\counterwithin{figure}{section}
\counterwithin{table}{section}

\section{Quantifying Model Volatility}
\label{app:model-vol}

While conducting initial experiments for this work, it became apparent that the evaluation metrics changed drastically when the windowed eating classifier was retrained even when all other variables were held constant (e.g. random number generators). We realized that this issue arose because the windowed eating classifier was trained to optimize per-datum accuracy, where each datum was a 6-minute window of IMU data consisting of 5400 total data points from 6 axes. This caused variability in eating \textit{episode} detection accuracy.

It is important to note that the thresholding algorithm used for the windowed eating classifier differs from that used with the daily pattern classifier. Instead of a single threshold, the method that was implemented used a dual-threshold hysteresis method with other heuristics. The 2 thresholds were the start threshold, $T_S$, and the end threshold, $T_E$. If the P(E) signal output by the model exceeded $T_S$, the start of an eating detection was marked. And, similarly when the probability signal fell below $T_E$, the end of an eating detection was marked. Based on previous work, $T_S$ was set to 0.8 and $T_E$ was set to 0.4 ~\cite{sharma2020}. As for other heuristics, detections within a half window length of each other were merged and detections less than 1 minute were ignored entirely. The sequence of detections was recorded as a binary array used in subsequent time and episode metric evaluation. 

We hypothesize that the variability can be attributed to the fact that the model was trained to optimize this per-window accuracy and therefore \textit{weighted} accuracy through class balancing during training. The hyperparameters $T_S$ and $T_E$ used in the hysteresis thresholding method for eating episode detection were manually tuned to optimize per-episode accuracy. However, this left the model at a carefully-balanced point-of-inflection where any deviation in model training could result in a considerable difference in eating episode detection and consequently time and episode metrics. Figure \ref{fig:model-volatility-example} shows an example situation where this could happen. If the model trained differently due to the model architecture, error surface, and gradient descent process, the probability of eating output by the model for a specific eating event could vary enough to no longer trigger $T_S$ in the hysteresis thresholding method. Since the output of hysteresis thresholding is used to create the sequences compared for both time and episode metrics, this would adversely affect both. For example, as shown in figure \ref{fig:volatility-metrics-comparison} if an interval that was classified as eating by one model due to narrowly exceeding the starting threshold, but fell below the threshold on another model the episode metrics would drastically vary.

\begin{figure}
\centering
\includegraphics[width=0.65\textwidth]{img/model_volatility_example.pdf}
\caption{Example of two P(E) results from the windowed eating model with hysteresis thresholds, $T_S$ and $T_E$. The solid line P(E) would trigger an eating episode detection with the provided $T_S$ and $T_E$, while the dotted line P(E) would not.}
\label{fig:model-volatility-example}
\end{figure}

\begin{figure}
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth]{img/window_vs_episode_loss_graphic1.pdf}
\caption{All three meals detected and most windows correctly classified. Window TPR = 70\% , episode TPR = 100\%}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth]{img/window_vs_episode_loss_graphic2.pdf}
\caption{All three meals detected and only one window correctly classified in the second meal. Window TPR = 60\%, episode TPR = 100\%}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth]{img/window_vs_episode_loss_graphic3.pdf}
\caption{Second meal is missed entirely so only two meals detected. Window TPR = 50\%, episode TPR = 70\%}
\end{subfigure}
\caption{Three examples indicating how window and episode metrics would change with varying amount of eating episode detection.}
\label{fig:volatility-metrics-comparison}
\end{figure}

To measure the volatility, the model was trained 30 times and a number of performance metrics were evaluated for each of the of 354 recordings in the CAD dataset. These metrics were: window weighted accuracy (Acc$_W$), time F$_1$ score, time true positive rate (TPR), time weighted accuracy (Acc$_W$), episode F$_1$ score, and episode TPR. The formulas for these metrics are defined in section \ref{metrics}. The fluctuation in these metrics was calculated by computing the average standard deviation per subject (``volatility'') in the dataset after retraining 30 total times. The model was trained with 5-fold cross validation 30 times and then the aforementioned metrics were calculated on a subject-by-subject basis. For example, a trained model was used to calculate time F$_1$ score for subject 1, then subject 2, and so on. The standard deviation of each metric was calculated per subject and then averaged together for a final measure of volatility. These results are summarized in table \ref{tab:model-volatility-results}.

\begin{table}
\renewcommand\arraystretch{1.5}
\centering
\begin{tabular}{|c|c|}
\hline
\rowcolor[HTML]{EFEFEF} 
Metric           & \begin{tabular}[c]{@{}c@{}}Average standard deviation \\ per subject {[}\%{]}\end{tabular} \\ \hline
Window Acc$_W$        & 3.4                                                                                        \\ \hline
Time Acc$_W$        & 4.7                                                                                        \\ \hline
Time TPR         & 9.3                                                                                        \\ \hline
Time F$_1$ Score    & 9.1                                                                                        \\ \hline
Episode TPR      & 9.5                                                                                        \\ \hline
Episode F$_1$ Score & 11.4                                                                                       \\ \hline
\end{tabular}
\caption{Windowed eating model volatility metrics for subsequent retraining.} 
\label{tab:model-volatility-results}
\end{table}

As seen in table \ref{tab:model-volatility-results}, the time weighted accuracy volatility was almost half the volatility of the other time and episode performance measures. Time and episode TPR and F$_1$ serve to explore the volatility in the other metrics that are evaluated for this dataset. Beyond time weighted accuracy, the volatility was higher than expected at around 10\% regardless of the metric. This supports our hypothesis that the number of true positive eating episode detections is to blame since weighted accuracy is the only metric in this collection that considers all TPs, TNs, FPs, and FNs. In summary, the model has been optimized to a somewhat precarious point that causes widely differing results based on small changes during the stochastic training process. 

There also appears to be further volatility coming from how the model is trained specifically. Window Acc$_W$ variability was also measured and found to be 3.4\%, which is much lower than other metrics, but still higher than expected. We hypothesize that this per-subject volatility in window weighted accuracy arises from how non-eating and eating samples are selected for training, which leads to highly variable training and therefore volatility in time and episode metric measurement.