\chapter{Results}

This chapter includes the results of several experiments analyzed in this work. First, attribute testing for the type and number of memory units used in the RNN layers of the model architecture is evaluated. Second, the general performance of our day-level classifier is evaluated. This analysis is performed by comparing the time and episode metrics across different values of the threshold $T$ in post-processing. Lastly, a detailed comparison is made between the results of the daily pattern classifier (with the values from previous experiments) and those from the windowed eating classifier. 

\section{Type and Quantity of Memory Units}
\label{mem_units}
A grid search was used to find the number of units $U$ in each layer and the type of layer (LSTM or GRU) that offered the best performance. Performance was measured using the episode true positive rate (TPR) and time weighted accuracy. Powers of 2 from 8 to 256 were used for the grid search since these are common in RNN design. The effect of the number of units and the type of unit on the episode TPR is shown in figure \ref{fig:rnn_units_tpr}. Similarly, the effect of these variables on the time weighted accuracy is shown in figure \ref{fig:rnn_units_wacc}. The metrics in these plots were evaluated with $T = 0.4$ for post-processing. 

\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{img/RNN_epTPR_units.pdf}
\caption{Effect of the number of units in each layer on TPR compared between LSTM and GRU cells, $T = 0.4$.}
\label{fig:rnn_units_tpr}
\end{figure}

\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{img/RNN_WAcc_units.pdf}
\caption{Effect of the number of units in each layer on time weighted accuracy compared between LSTM and GRU cells, $T = 0.4$.}
\label{fig:rnn_units_wacc}
\end{figure}

In both figures there is a maximum point at 16 memory units -- for both the LSTM and GRU -- with a decline of 4-5\% as the number of memory units increases to 256. Fewer units also offered slightly reduced performance. In both episode TPR and time weighted accuracy the GRU layers outperformed the LSTM layers by 1-2\% for every number of memory units except 128. GRUs have been shown to pick up on less prevalent patterns~\cite{gruber2020}, so this aspect may explain this slight performance difference. Furthermore, LSTMs are more performant when extensive, long-term context is required. On the contrary, the data used with this classifier is relatively short (only hundreds of data long) and 1-dimensional. Based on these results, a value of $U$ = 16 units was chosen with a GRU layer over an LSTM layer. All further results are reported with the parameter $U = 16$.

\section{Performance Analysis}

\begin{figure}
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2138_d.pdf}
\caption{Daily pattern classifier, $P$($E_d$)}
\label{fig:low-response-ped-daily}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2138_w.pdf}
\caption{Windowed eating classifier, $P$($E_w$)}
\label{fig:low-response-ped-window}
\end{subfigure}
\caption{Comparison between $P$($E_d$) and $P$($E_w$) for a sample with high background noise. Daily pattern classifier shows a subdued response with an input of this type. GT shown with green bars.}
\label{fig:low-response-ped}
\end{figure}

Figure \ref{fig:low-response-ped-daily} depicts an example output from our day-level classifier, for the input from the window-based classifier shown in figure \ref{fig:low-response-ped-window}. It can seen that the background noise is much lower in $P$($E_d$) than in $P$($E_w$), demonstrating that the daily pattern classifier can mitigate many of the transient responses from the window-based eating classifier. The overall level of $P$($E_d$) is also lower and more subdued than $P$($E_w$). Because of this, the threshold used for final segmentation on $P$($E_d$) needed to be much lower than the threshold used for $P$($E_w$) and even the standard cut-off of 0.5 used for a single-threshold approach. 

A range of values for the threshold $T$ were tested. To evaluate these thresholds, the episode true positive rate (TPR), number of false positives per true positive (FP/TP), and time weighted accuracy (Acc$_W$) were all computed at each value $T$. The goal for daily pattern classifier performance was an episode TPR exceeding 85\% to roughly match performance reported in previous work~\cite{sharma2020} and less than 1.0 FP/TP.

The effect of $T$ on the time weighted accuracy is shown in figure \ref{fig:threshold_wacc}. At first, values of $T$ decreasing from 0.8 to 0.1 in steps of 0.05 were tested since 0.8 was the value that offered the best performance with $P$($E_w$)~\cite{sharma2020}. Yet, for this classifier it led to the worst performance. Furthermore, the weighted accuracy continues to increase as $T$ approaches 0.1. So, values decreasing from 0.1 to 0.01 in steps of 0.01 were tested as well. And, as depicted in figure \ref{fig:threshold_wacc}, Acc$_W$ reaches a maximum at $T = 0.05$ and then sharply drops off.

Likewise, the effect of $T$ on the episode TPR and FP/TP is shown in figure \ref{fig:threshold_tpr}. As $T$ decreases, the episode TPR increases and the number of FP/TP increases too. Values of $T$ between 0.2 and 0.05 are situated in the optimal region known as the``knee'' of the curve. However, the goal of less than 1.0 FP/TP limited $T$ to values over 0.08. Values of $T$ below the maximum weighted accuracy seen in figure \ref{fig:threshold_wacc} at $T=0.05$ exhibit higher episode TPR rates, but at the expense of weighted accuracy. This means that at these thresholds the model is classifying large portions of the recording as eating since $T$ is so low. Ultimately, $T = 0.1$ was chosen to meet these objectives. All further results are reported using the threshold $T = 0.1$.

Figure \ref{fig:threshold_tpr} also shows the performance of the windowed eating model for various values of $T_S$ while $T_E$ was held constant at 0.3 reported in~\cite{sharma2020}. This figure gives a visual representation of the proximity of performance between the two classifiers in addition to the improvement in FP/TP discussed further in the next section.

\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{img/threshold_comparison_wacc.pdf} 
\caption{Effect of threshold $T$ on time weighted accuracy for daily pattern classifier.}
\label{fig:threshold_wacc}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{img/threshold_comparison_tpr.pdf}
\caption{Effect of threshold $T$ (number next to points) on episode TPR and FP/TP for the daily pattern model. The effect of threshold $T_S$ (number next to points) on the window-based classifier with $T_E$ = 0.3 reported in~\cite{sharma2020} is also shown for reference. \FiveStar \hspace{2pt}, \textcolor{gray}{\scriptsize \SquareSolid} indicate selected values.}
\label{fig:threshold_tpr}
\end{figure}

\section{Comparison to Previous Work}

We compare our results to the CNN window-based eating classifier from previous work. All time evaluation metrics for the daily pattern classifier and the windowed eating classifier are shown in table \ref{tab:timeresults}. Similarly, episode metrics for these two classifiers
is shown in table \ref{tab:episoderesults}. The results for the windowed eating classifier are those reported in previous work with $T_S$ = 0.8 and $T_E$ = 0.4~\cite{sharma2020}. A comparison between the results reported in~\cite{sharma2020} and those obtained from replicating the experiment are included in appendix \ref{app:replication}. When we compare the evaluation measures of the day-level classifier and the window-based eating classifier we find that the time and episode metrics are comparable, while there is a drastic reduction in the number of FP/TP (i.e. false detections per true detection). With the chosen threshold of $T = 0.1$, the daily pattern classifier achieved an 85\% true positive rate (TPR) for eating episodes with only 0.8 FP/TP and a time weighted accuracy of 85\%. For the windowed eating model, an 89\% episode TPR with 1.7 FP/TP and a time weighted accuracy of 80\% was reported in previous work~\cite{sharma2020}. This is only a 4\% decrease in episode TPR with the daily pattern classifier, but a 53\% decrease in the number of FP/TP at the chosen $T=0.1$. However, there is lower FP/TP values for all episode TPR values when compared. Full evaluation results are shown in table \ref{app:table-daily-results} in appendix \ref{app:replication}.

\begin{table}
\renewcommand\arraystretch{1.5}
\centering
\begin{tabular}{|l|c|c|c|c|c|}
\hline
\rowcolor[HTML]{EFEFEF} 
\textbf{Model}        & \textbf{TPR (\%)} & \textbf{TNR (\%)} & \textbf{F$_1$ (\%)} & \textbf{Precision (\%)} & \textbf{Acc$_W$ (\%)} \\ \hline
Window-based Classifier~\cite{sharma2020} & 69                & 93                & 48               & 36                      & 80                   \\ \hline
Daily Pattern Classifier   & 78                & 93                & 50               & 37                      & 85                   \\ \hline
\end{tabular}
\caption{Time metrics comparing windowed eating classifier and daily pattern classifier (this work). The latter shows a 5\% increase in Acc$_W$ and a 9\% increase in TPR with other metrics remaining similar.} 
\label{tab:timeresults}
\end{table}

\begin{table}
\renewcommand\arraystretch{1.5}
\centering
\begin{tabular}{|l|c|c|}
\hline
\rowcolor[HTML]{EFEFEF} 
\textbf{Model}         & \textbf{TPR (\%)} & \textbf{FP/TP} \\ \hline
Window-based Classifier~\cite{sharma2020} & 89                & 1.7            \\ \hline
Daily Pattern Classifier   & 85                & 0.8            \\ \hline
\end{tabular}
\caption{The daily pattern classifier shows a 4\% decrease in TPR, but a 53\% decrease in the number of FP/TP. Eating episode metrics shown for the window-based eating classifier.}
\label{tab:episoderesults}
\end{table}

For further analysis, we consider a signal processing perspective. With our day-length $P$($E$) signal, figures \ref{fig:model-comparison1} - \ref{fig:model-comparison-fn} demonstrate that our daily pattern classifier produces a signal with a much higher signal-to-noise ratio (SNR) than the windowed eating model. This means that the meal peaks in the $P$($E_d$) signal are more distinguishable from the other background noise. Conversely, the $P$($E_w$) signal for the window-based eating model has a much lower SNR. The lower SNR necessitated the more complex hysteresis thresholding method. Despite this post-processing approach, there was a much higher number of false detections. Our new daily-level classifier improves considerably in this regard. Overall, the daily pattern classifier produced far fewer false detections than the windowed eating classifier from previous work.

Figures \ref{fig:model-comparison1} - \ref{fig:model-comparison-fn} show direct comparisons between the output of the windowed eating classifier and the output of the daily pattern classifier. Figures \ref{fig:model-comparison1} - \ref{fig:model-comparison4} compare the $P$($E_d$) and the $P$($E_w$) output for the same individuals shown in section \ref{motivation}. In each, the ground truth events are shown with green bars and detections for the respective classifiers are shown above the GT with blue bars. 

In figure \ref{fig:model-comparison1} there is a recording where the daily pattern classifier retains detection of all three eating episodes and reduces much of the background noise in the signal to zero. While there was no improvement in episode detection in this case, it can be seen that the overall output is more clean and refined.

Figure \ref{fig:model-comparison2} shows average performance of the daily pattern classifier with all three eating episodes detected and a 50\% reduction in the number of false detections. The only false positive is triggered early in the recording, down from the two transient detections near the third strong peak. These brief responses likely indicate meal preparation/clean-up or light snacking before and after the last strong peak of the day in the $P$($E_w$). With our day-level classifier, they have been refined and melded into the strongest of the three peaks to make up the last detection in the $P$($E_d$). 

The recording shown in figure \ref{fig:model-comparison3} has a generally noisy $P$($E_w$) signal with several false detections throughout the day. The daily pattern classifier filtered many of them, but it did not substantially reduce the peak around 16:55, which triggered a false detection in the $P$($E_d$). This exemplifies excellent performance by the daily pattern classifier with an 80\% reduction in the number of false detections. Other than the one false detection mentioned, the $P$($E_d$) signal for this recording is clean and all three meals are detected. 

In figure \ref{fig:model-comparison4} a $P$($E_w$) recording is shown with abundant background noise in the first half and a noticeable period of rest in the second half. The $P$($E_d$) generated from this recording has remarkably low noise with a large difference between the peaks and the ambient signal. All three meals were detected and the number of false detections was reduced from 5 to 0 demonstrating a best case scenario for the daily pattern classifier.

One final example in figure \ref{fig:model-comparison-fn} is included to show that our daily pattern classifier cannot recover true positive eating episodes from false negatives in the $P$($E_w$) signal. If an eating episode is missed by the windowed eating classifier and it is consequently not recognizable as an eating episode in the $P$($E_w$) sequence, it will very likely not be recovered. In the case of figure \ref{fig:model-comparison-fn}, the last ground truth meal is not recognized by either the windowed eating classifier or the daily pattern classifier. There are still 75\% fewer false detections, but the false negative episode is not recovered as a true positive by the daily pattern classifier. This specific $P$($E_w$) recording was very tumultuous, which produced a less confident model output with subdued peaks around 0.3 in the $P$($E_d$) signal. As a result, the last meal in the recording was suppressed and did not qualify as a meal. A lower threshold $T$ for this specific recording would perhaps register this detection but changing $T$ would not be advantageous on the whole.
\begin{figure}[h!]
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2193_d.pdf}
\caption{Daily pattern classifier $P$($E_d$) with $T = 0.1$. 3 TP, 0 FP, 0 FN}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2193_w.pdf}
\caption{Windowed eating classifier $P$($E_w$) with $T_S = 0.8$ and $T_E = 0.4$. 3 TP, 0 FP, 0 FN}
\end{subfigure}
\caption{Comparison between $P$($E_d$) and $P$($E_w$) showing noise reduction, but no marked improvement on a $P$($E_w$) recording with low background noise. Detections shown with blue bars (top) and GT shown with green bars (bottom).}
\label{fig:model-comparison1}
\end{figure}
\begin{figure}
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2356_d.pdf}
\caption{Daily pattern classifier $P$($E_d$) with $T = 0.1$. 3 TP, 1 FP, 0 FN}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2356_w.pdf}
\caption{Windowed eating classifier $P$($E_w$) with $T_S = 0.8$ and $T_E = 0.4$.  3 TP, 2 FP, 0 FN}
\end{subfigure}
\caption{Comparison between $P$($E_d$) and $P$($E_w$) showing decent performance of the daily pattern classifier with a 50\% reduction in the number of false positives in the $P$($E_d$). Detections shown with blue bars (top) and GT shown with green bars (bottom).}
\label{fig:model-comparison2}
\end{figure}

\begin{figure}
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2483_d.pdf}
\caption{Daily pattern classifier $P$($E_d$) with $T = 0.1$. 3 TP, 1 FP, 0 FN}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2483_w.pdf}
\caption{Windowed eating classifier $P$($E_w$) with $T_S = 0.8$ and $T_E = 0.4$. 3 TP, 5 FP, 0 FN}
\end{subfigure}
\caption{Comparison between $P$($E_d$) and $P$($E_w$) showing an 80\% decrease in the number of false positives in the $P$($E_d$). Detections shown with blue bars (top) and GT shown with green bars (bottom).}
\label{fig:model-comparison3}
\end{figure}

\begin{figure}
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2236_d.pdf}
\caption{Daily pattern classifier $P$($E_d$) with $T = 0.1$. 3 TP, 0 FP, 0 FN}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2236_w.pdf}
\caption{Windowed eating classifier $P$($E_w$) with $T_S = 0.8$ and $T_E = 0.4$. 3 TP, 5 FP, 0 FN}
\end{subfigure}
\caption{Comparison between $P$($E_d$) and $P$($E_w$) showing a 100\% reduction in the number of false positives in $P$($E_d$) from a $P$($E_w$) recording with high noise. Detections shown with blue bars (top) and GT shown with green bars (bottom).}
\label{fig:model-comparison4}
\end{figure}

\begin{figure}
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2138_d_detect.pdf}
\caption{Daily pattern classifier $P$($E_d$) with $T = 0.1$. 2 TP, 1 FP, 1 FN}
\end{subfigure}
\\[12pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/model_comparison_P2138_w_detect.pdf}
\caption{Windowed eating classifier $P$($E_w$) with $T_S = 0.8$ and $T_E = 0.4$. 2 TP, 4 FP, 1 FN}
\end{subfigure}
\caption{Comparison between $P$($E_d$) and $P$($E_w$) where an FN episode (last ground truth event) is not recovered. Detections shown with blue bars (top) and GT shown with green bars (bottom).}
\label{fig:model-comparison-fn}
\end{figure}
