\chapter{Introduction}

\section{Overview}
% What is the problem?
This thesis considers the problem of detecting when a person is eating during everyday life by tracking their wrist motion using sensors found in a typical smartwatch. The motivation for this work is to help people track their energy intake, which is critical for weight loss and treating obesity. The following paragraphs outline the background for this work, which is covered in more detail in the rest of this chapter.

Obesity is an extremely challenging and widespread global health problem. Obesity can develop when energy intake (food consumption) is consistently higher than energy expenditure. Energy expenditure overall is much lower in the modern era since sedentary lifestyles are common and transportation is largely motorized. Thus, tracking energy intake is a valuable tool in fighting overweight and obesity. Traditionally, this is done with self-monitoring tools like food diaries, but these methods are tedious and subject to self-reporting bias and inaccuracy. A system that could automatically monitor energy intake would be more advantageous and effective. 

Our group has been investigating methods involving wrist motion tracking to detect when a person eats~\cite{dong2009, dong2011, dong2013, luktuke2020, sharma2020, wei2021}. Previous work used a Bayesian classifier to segment episodes of eating from data collected with wrist-worn smartphones~\cite{dong2013}. Later work used a convolutional neural network and a sliding window approach that improved eating detection accuracy by almost 10\%~\cite{sharma2020}. A limitation of both these approaches is that they examined small windows of time (on the order of minutes) to determine if eating was occurring or not within that window. \newpage This thesis considers the possibility that additional context exists in a longer window, specifically an entire day, that can improve recognition accuracy. Figure \ref{fig:contextoverview} demonstrates our motivating principle. People tend to eat 3-4 meals a day at regularly spaced intervals. People also tend to eat right after a long period of rest (sleep). These patterns, as well as others in daily activities, could potentially be learned by a new classifier.

The remainder of this introductory chapter includes the background and context for this thesis. Section \ref{obesity} explores the details and causes of the obesity epidemic facing much of the world population. Sections \ref{mhealth} and \ref{adm} discuss the field of mHealth and how it can impact personal health along with the concept of automated dietary monitoring (ADM). Section \ref{shimmer} briefly describes the wrist-worn motion tracking devices used to collect the original eating data. Section \ref{nn} includes background information in recurrent neural networks and deep learning, which are used to analyze the day-to-day patterns of eating behavior in this work. Next, section \ref{related-work} looks at other relevant work in the field of eating detection and energy intake monitoring. Penultimately, section \ref{motivation} encapsulates the motivation for this work and describes the use of daily contextual indicators for segmenting eating episodes. And lastly section \ref{novelty} outlines the novelty of this thesis and the questions this work strives to answer.

\begin{figure}
\centering
\includegraphics[width=\textwidth]{img/context_comparison.pdf}
\caption{Comparison of daily context (24 hours) and window context (6 minutes) to scale.}
\label{fig:contextoverview}
\end{figure}

\section{Background}
% Why is it a problem?
\subsection{Obesity}
\label{obesity}

Obesity is not just a challenging health problem; it is an epidemic. As of 2016, over 1.2 billion adults (18 and older) worldwide are overweight and over 650 million are obese~\cite{who2021}. In the United States, 42.4\% of people over the age of 20 are obese with a body mass index (BMI) [kg/m$^2$] exceeding 30 according to a 2017-2018 report from the Centers for Disease Control and Prevention (CDC)~\cite{cdc2020}. It is estimated that by 2030, most of the American population will qualify as overweight and nearly half the population will be obese~\cite{wang2020}. Although many researchers debate the validity of BMI for measuring body fat percentage and obesity~\cite{freedman2009, nuttall2015, rothman2008}, the relative increase in these measures is alarming.

Worldwide obesity has tripled since 1975 and childhood obesity (ages 5-19) has increased ten-fold in that same time~\cite{who2018worldstats, who2021}. This dramatic growth is attributed to poor diets high in fat and sugar, increased inactivity, and widespread sedentary lifestyles. Obesity is associated with increased risk of cardiovascular diseases (heart disease and stroke), diabetes, and some cancers, among other diseases and health conditions. More specifically, there is sufficient evidence of increased risk of colon, kidney, liver, and pancreatic cancer with obesity~\cite{lauby2016}. The recent COVID-19 pandemic has increased sedentary time, reduced physical activity, and increased unhealthy eating habits further amplifying the conditions that lead to unhealthy weight gain and obesity~\cite{narici2020impact}. 

Nevertheless, obesity is almost entirely preventable~\cite{who2021}. It comes down to balancing energy intake and energy expenditure, both commonly measured in the unit of calories. If energy intake is greater than energy expenditure, the human body will store energy in the form of fat as a physiological response anticipating future food scarcity. If an imbalance exists, energy intake and energy expenditure could become balanced through greater energy expenditure in the form of physical activity or exercise. However, this is not as efficient or effective for weight loss as reducing energy intake~\cite{tataranni2003}. Fundamentally, the easiest way to maintain a healthy energy balance is to reduce excessive energy intake and overconsumption of food. Although this may sound simple, drastic lifestyle changes are required if overconsumption has been the status quo for some time.
\subsection{mHealth}
\label{mhealth}

Worldwide health and life expectancy have greatly improved in the last century. As a result, the most common causes of death have transitioned to chronic conditions and noncommunicable diseases (NCDs), or diseases that are not directly transmissible between people. According to a 2020 report by the World Health Organization (WHO), over 70\% of deaths worldwide are caused by NCDs including cardiovascular diseases, cancer, and diabetes~\cite{who2020ncd}. Most, although not all, of these NCDs are associated with certain lifestyle risk factors like unhealthy diet, overconsumption, and a deficiency of physical activity. This means that many NCDs are entirely preventable causes of death.
\newpage
Good personal health is paramount in reducing the risk of these diseases. Consequently, healthcare overall will need to shift from a paradigm of reactive care to one of proactive care. The rise of personal electronic devices like smartphones and wearables like fitness trackers and smartwatches has ushered in a new age of possibility for health tracking to assist with this shift. Using mobile devices and especially wearable devices to monitor personal health is known as mobile health or mHealth. And although electronic devices have contributed to a more sedentary lifestyle in some ways, the field of mHealth also offers ways for them to improve personal health in the battle against overweight and obesity. 

\begin{figure}
\centering
\includegraphics[width=0.45\textwidth]{img/apple_watch.jpeg}
\caption{Apple Watch Series 6 smartwatch and fitness tracker~\cite{applewatch}.}
\label{fig:applewatch}
\end{figure}

The Apple Watch and Fitbit are among the most popular fitness trackers from the past 8 years~\cite{wearables2021}. These devices make tracking personal health and weight management approachable and straightforward. For example, the latest Apple Watch (shown in figure \ref{fig:applewatch}) is equipped with several biological instruments including a heart rate monitor, FDA-approved electro cardiogram (ECG), and blood oxygen sensor~\cite{applewatch}. The device uses these various sensors to track caloric energy expenditure during workouts, resting heart rate over time, irregular heartbeat, and much more. This data can even be securely shared with an individual's doctor or physician if the user chooses to do so with the latest software~\cite{watchos}. This is an example of the shift to proactive healthcare that is needed to detect conditions before they become life threatening or even before they develop. Nonetheless, for weight loss, the smartwatches and fitness trackers on the market today only monitor energy expenditure.

\subsection{Automatic Dietary Monitoring}
\label{adm}

There are several ways to monitor energy intake. Perhaps the easiest is through self-monitoring, where an individual records what foods they eat throughout the day and when they consume them. This may take the form of paper or electronic food diaries or even calorie logging smartphone apps.  A literature review of 15 studies focused on dietary self-monitoring found a significant relationship between self-monitoring and weight loss~\cite{burke2011}. A mixture of paper and electronic food diaries were used in the studies. It was noted that there was a decrease in the degree of self-monitoring completeness as the studies progress. Additionally, the individuals who maintained higher self-monitoring compliance throughout the studies experienced greater weight loss. This research demonstrates that the barrier to effective weight loss may be associated with the difficulty and tedium associated with maintaining dietary self-monitoring for a long period of time. 

Furthermore, self-monitoring approaches are subject to human error and bias as they rely on an individual's ability to accurately report, or worse, recount, what they ate and how much they ate. According to a study, both common people and dietitians alike underreport the amount of calories they consume in a day~\cite{champagne2002}. The goal of the field of automated dietary monitoring (ADM) is to log energy intake automatically, so it can be accurately tracked~\cite{amft2005}. Based on previous studies~\cite{burke2011} this could aid individuals in weight loss and decrease the prevalence of obesity. Beyond supporting weight loss, ADM could also assist elderly individuals, especially those with Alzheimer's disease or dementia, maintain an accurate record of their meals.

Numerous devices already exist to accurately track energy expenditure, but a wearable device that can accurately track energy intake does not exist yet. We believe this is an important part of the puzzle. The goal of our research group is to develop a system that can be used to monitor food intake. Previous work has shown that it is possible to track ingestion events (bites) with wrist-worn devices~\cite{dong2009}. However, these systems are quite sensitive to false detection events that resemble eating like brushing teeth or self grooming~\cite{sharma2020}. The goal of this work and others (see section \ref{related-work}) is to build a way to automatically recognize eating so the other, more sensitive bite counting systems can only be triggered when needed. This could be used to accurately predict the amount of food consumed by an individual in their everyday life and help with weight loss, eating disorders, and even eating speed~\cite{dong2009}.

\begin{figure}
\centering
\includegraphics[width=0.58\textwidth]{img/Shimmer3_Device.png}
\caption{Shimmer3 wearable IMU sensor device.}
\label{fig:shimmer}
\end{figure}

\subsection{Shimmer3 Device}
\label{shimmer}

The wrist motion data for the CAD dataset was recorded on Shimmer3 devices~\cite{cad2020}. This device, shown in figure \ref{fig:shimmer}, is a wearable piece of hardware that contains an inertial measurement unit (IMU) to measure and track motion. The IMU in each device was equipped with an accelerometer, gyroscope, and magnetometer (not used), to record a total of 6 axes of motion - linear acceleration and rotational acceleration with respect to the $x$, $y$, and $z$ axes. 

The large, circular, orange button on the front of the device was used as a way for the participant to record when they started and stopped eating. The LED lights on the front indicate the status of the device as well as different operational modes. The LEDs flashed in a regular pattern to notify the individual it was actively recording data. Data was recorded to an onboard microSD card. When data collection was completed for a day, the device was connected to a dock via the port on the bottom of the device (not shown). Data was imported from the device using the Shimmer Consensys software installed on a computer and then exported to a CSV file for subsequent processing. The entire procedure of processing the data for the CAD dataset is included in~\cite{sharma2020}.

% introduction to the benefits of neural networks
\subsection{Neural Networks}
\label{nn}

% what is a neural network?
Neural networks fall into the broad field of artificial intelligence (AI) that has exploded in the past decade. At a high level, an artificial neural network is a collection of neurons or nodes that each have an input, output, and perform a node function. There are additive biases for each node and weighted connections between the nodes known as weights. A neural network is defined by these parameters (weights and biases) and the connections between nodes. A collection of nodes with the same input and output that are performing the same function are known as a layer. 
% what is an activation function?
The node function mentioned previously is more commonly known as an activation function. Activation functions control the output of a node. There are many common activation functions used in neural networks, but we will only introduce the rectified linear unit (ReLU), tanh, and sigmoid functions since they are used in this work. The ReLU activation function serves to clip any negative values. The ReLU functions is defined as $\text{ReLU}(x) = \text{max}(0, x)$, where all values less than 0 are clipped to 0. The tanh activation function is simply the hyperbolic tangent function, i.e. the hyperbolic analogue of the circular tangent function using in trigonometry, $\text{tanh}(x) = \dfrac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$. The sigmoid activation function produces an output in the range from 0 to 1. It is defined as $\sigma(x) = \dfrac{1}{1+e^{-x}}$. Plots of all three activation functions -- ReLU, tanh, and sigmoid -- are shown in figure \ref{fig:activation-funcs}

\begin{figure}
\centering
\includegraphics[width=0.62\textwidth]{img/activation_functions.pdf}
\caption{Rectified linear unit (ReLU), sigmoid, and tanh activation functions.}
\label{fig:activation-funcs}
\end{figure}

% what is training?
In order for a neural network to ``learn'' the best parameters for a specific problem, it undergoes training. For training, the network is initialized in some manner and then the input is provided. The network uses the weights and biases of the nodes to compute an output and this output is scored for accuracy. The accuracy of the output is used to determine how much the weights and biases should be adjusted. A complete cycle of training with all of the training data is called an epoch.

There are three types of training: supervised, unsupervised, and reinforcement learning. In supervised training, the data processed by the network is also labeled with a correct set of outputs known as the ground truth or GT. The network is then able to process the input to generate an output and directly compare it to the desired results. Supervised learning is typically used for classification or regression tasks. When using supervised training with a neural network, it is important that the network is not trained and tested on the same data. If this were to occur, the model would likely achieve high accuracy but not transfer well to other data as it could just ``memorize'' the testing data. Unsupervised training is performed with data that has no labeled ground truth training data. This technique is used when ground truth data may not be available or the goal is to reveal latent features that are not apparent from the raw data. Lastly, reinforcement learning does not have labeled training data, but instead associates rewards or penalties with different actions. For example, reinforcement learning is used when training neural networks to play video games.

% what is a loss function?
The error between the network output and the desired output is also known as the loss. The function that quantifies this value is the loss function. Basically, the loss function indicates how well the model performs with the current parameters. There are many different loss functions used in neural networks, but we will only focus on those used in this thesis. Others are beyond the scope of this work. The curious reader is encouraged to look to other literature or the Internet for more information on loss functions. 

There are two classes of loss functions: regression and classification. Regression loss functions calculate the extent of inaccuracy when the network is predicting continuous values, while classification loss functions compute the error when predicting a discrete value like a category. For example, mean absolute error (MAE) and mean squared error (MSE) are regression loss functions. MAE is calculated as the average absolute value of the difference between the predictions ($x_i$) and ground truth values ($y_i$)  and MSE is calculated as the average square of this difference (see equations \ref{eqn:mae} and \ref{eqn:mse}). Classification loss functions include categorical cross-entropy and binary cross-entropy (discussed later in chapter \ref{methods}).
\begin{equation}
\label{eqn:mae}
\text{MAE} = \frac{1}{n} \sum_{i=1}^n \lvert y_i - x_i\rvert
\end{equation}
\begin{equation}
\label{eqn:mse}
\text{MSE} = \frac{1}{n}\sum_{i=1}^n \left(y_i - x_i\right)^2
\end{equation}
\begin{figure}
\centering
\includegraphics[width=0.68\textwidth]{img/gradient_descent.png}
\caption{Example of gradient descent on a 3D surface with 3 different starting points.}
\label{fig:gd}
\end{figure}
% what is gradient descent?
\qquad \quad Gradient descent is used to find the best set of parameters for a neural network. It is the process of calculating derivatives and slowly moving along a function to approach local or global minima (see figure \ref{fig:gd}). The amount to change the parameters with each set of training samples is specified by a parameter known as the learning rate $\eta$ and the direction is found by calculating the negative gradient of the error. The negative gradient is used since the positive gradient would indicate the direction of the steepest increase in error. The steepest \textit{decrease} is needed instead because the goal is to reach the global minimum error and produce an output that most closely matches the correct output. The learning rate is a small number  $0 < \eta < 1$ that regulates the size of the steps taken by gradient descent. The basic weight update equation for gradient descent is shown in equation \ref{eqn:gd}, where $w$ is a matrix of the network weights, $\eta$ is the learning rate, and $\nabla E[w]$ is the gradient of the error. 
\begin{equation}
\label{eqn:gd}
w = w-\eta \nabla E[w]
\end{equation}
There are three types of gradient descent: gradient descent, stochastic gradient descent, and mini-batch gradient descent. Standard (or batch) gradient descent updates the weights only after all of the training samples have been considered by summing the error for each. Stochastic gradient descent updates the weights after each individual training sample is considered and mini-batch gradient descent is a balance of both. Mini-batch gradient descent computes gradients and updates weights after groups or batches of training samples are considered. This is the preferred method because it reduces compute time and gives a good approximation of the overall gradient while mitigating noise in the data.

A simple fully-connected or dense neural network is shown in figure \ref{fig:dnn}. A fully-connected network has an input layer, output layer, and one or more hidden layers. This is perhaps the most basic form of a neural network.

\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{img/neural_network.pdf}
\caption{Simple fully-connected neural network architecture with one hidden layer.}
\label{fig:dnn}
\end{figure}

% what is deep learning?
Deep learning is a subset of machine learning and artificial intelligence that uses neural networks with many hidden layers known as deep neural networks. Training deep neural networks makes use of a technique called backpropagation to update the weights in the network. Backpropagation is needed since the desired values are not known for intermediate hidden layers in the network.

% what types of networks are there?
There are several different types of neural networks. First, there is the fully-connected neural network described earlier. In this type of network, every node is connected between two adjacent layers (see figure \ref{fig:dnn}). Fully-connected networks can be made ``deeper'' by increasing the number of hidden layers. This concept applies to other types of neural networks as well.

A convolutional neural network (CNN) is a type of artificial neural network that uses convolution with filters learned during training to generate an output. CNNs are most common in image and speech processing, but they have applications in other one-dimensional (1D), two-dimensional (2D), and even problems with higher dimensionality. Since the convolutional filters or kernels that move across the input use shared weights, these networks usually require drastically fewer parameters. There are even some common, specialized architectures built using these techniques in tandem with other notable approaches like residual connections as in U-Net~\cite{unet}. 

For this work, the neural networks were built in a Python library called TensorFlow. TensorFlow is an open source machine learning platform that was developed by Google in 2015~\cite{tensorflow2015}. Both TensorFlow and the high-level Keras application programming interface (API) built-in to TensorFlow were used.

\subsection{Recurrent Neural Networks}

\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{img/rnn.pdf}
\caption{Simple recurrent neural network with inputs $x_t$, outputs $y_t$, hidden states $h_t$, and activations $a_t$ for each timestep $t$.}
\label{fig:rnn}
\end{figure}

A recurrent neural network (RNN) is a neural network where the neurons are arranged in a sequential manner and the input is parsed in the same fashion. Each neuron has an input and an output like normal neurons, but also a hidden state. The hidden state information from previous neurons is used as an input to subsequent neurons. A diagram that depicts the relationship between nodes in an RNN is shown in figure \ref{fig:rnn}. There are several different categories of RNN. A one-to-many RNN has one timestep in the input and many timesteps in the output. Many-to-one and many-to-many RNNs are also named accordingly. The type of RNN used in this work is a many-to-many RNN (shown in figure \ref{fig:rnn}) where both the input and output have multiple timesteps, which is a paradigm commonly used for machine translation. Moreover, an RNN can also be bi-directional (shown in figure \ref{fig:brnn}). In this case, a forward pass of the input is completed first and then the data is flipped for a reverse pass. This allows the RNN to learn attributes of the sequence in both directions. It is also important to note that instead of vanilla backpropagation, backpropagation through time (BPTT) is needed to train an RNN which can be loosely explained as backpropagation back through each of the timesteps of the network. 

The weights and biases in an RNN are shared among all timesteps. This means that increasing the length of the input does not increase the number of parameters in the network. By sharing weights, the number of parameters is drastically reduced and the backpropagation process is simplified. 

\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{img/bidirectional_rnn.pdf}
\caption{Bidirectional recurrent neural network with inputs $x_t$, outputs $y_t$, and activations $a_t$ for each timestep $t$.}
\label{fig:brnn}
\end{figure}

\begin{figure}
\centering
\begin{subfigure}{0.9\textwidth}
\includegraphics[width=\textwidth]{img/lstm_diagram.pdf}
\caption{Long short-term memory (LSTM) cell with memory cell $C$, input gate $\Gamma_i$, output gate $\Gamma_o$, and forget gate $\Gamma_f$.}
\label{fig:lstm}
\end{subfigure}
\\
\vspace{1.5cm}
\begin{subfigure}{0.9\textwidth}
\includegraphics[width=\textwidth]{img/gru_diagram.pdf}
\caption{Gated recurrent unit (GRU) with hidden state $h$, update gate $\Gamma_z$ and reset gate $\Gamma_r$.}
\label{fig:gru}
\end{subfigure}

\caption{Types of recurrent neural network neurons with input $x_t$, hidden state $h_t$, cell state $c_t$, and output $y_t$.}
\end{figure}

The two common types of neurons used in RNNs are long short-term memory units (LSTMs) and gated recurrent units (GRUs). An LSTM cell is a special type of neuron that includes 3 special gates. A diagram of an LSTM cell is shown in figure \ref{fig:lstm}. The 3 gates are the input gate, forget gate, and output gate. The input and output gates regulate whether the input or output is allowed to pass in or out of the cell respectively. The forget gate controls whether or not the value currently in the memory cell is erased. Each gate processes its designated inputs through a sigmoid function $\sigma (x)$ involving learned weights to produce an output. The general formula for a gate $\Gamma$ is depicted in equation \ref{eqn:rnn-gates}, where $x_t$ is the input, $h_{t-1}$ is the previous hidden state, $W$ and $U$ are weight matrices, and $b$ is a bias matrix.
\begin{equation}
\Gamma = \sigma(Wx_t + Uh_{t-1} + b)
\label{eqn:rnn-gates}
\end{equation}

There is a large amount of complexity associated with a single LSTM neuron, so the GRU is often used instead. A GRU cell is a simplified version of the LSTM cell that merges the cell state and hidden state and combines the input and forget gates into an update gate. A reset gate is used instead of a forget gate, but the functions are similar concerning how much past information affects the output. As such, there is no control of a memory cell and the full hidden state is exposed to the subsequent neuron. A diagram of a GRU is shown in figure \ref{fig:gru}. Due to fewer gates and less complex structure, GRUs are more computationally efficient than LSTMs. GRUs have also been shown to outperform LSTMs for some tasks. More specifically, GRUs generally learn less prevalent patterns better, while LSTMs learn highly prevalent patterns better~\cite{gruber2020}. LSTMs also tend to perform better when deep understanding and long-term context is needed due to their memory. As a side note, in many deep learning frameworks the number of ``units'' in a cell refers to the dimensionality of the hidden state and cell state.

% How have other people tried to solve it?
\section{Related Work}
\label{related-work}

A multitude of methods for tracking food consumption and ADM have been explored. Researchers have had success with placing sensors on the throat, neck, ears, wrist, and even eyeglasses to monitor energy intake. Amft et al. used an in-ear microphone to analyze chewing sounds of four different foods from four individuals~\cite{amft2005}. This method enabled the authors to accurately classify when the individual was chewing and the type of food they were eating, but only from the four preselected foods. They later modified their approach to use a less invasive microphone placed outside the ear and achieved slightly lower accuracy due to environmental noise. Makeyev et al. proposed the use of a throat microphone to reduce ambient noise as they exploit vibrations on the surface of the skin instead of vibrations in the air~\cite{makeyev2008}. The reported results for swallowing recognition were quite good ($>$95\% accuracy) in a lab environment, but ultimately the method was dismissed due to the inconvenient sensor placement on the throat. Nguyen et al. approached the problem with a slightly different angle and utilized a recurrent neural network to detect and characterize eating using the number of times a person swallows~\cite{nguyen2017}. Data was collected from 10 subjects in a controlled environment using a wearable necklace with two piezoelectric sensors and an IMU. The long short-term memory (LSTM) network developed for this task achieved a reported 74\% F1 score for swallow detection.

More recently, Gao et al. developed an ADM system to detect eating episodes with the microphones in off-the-shelf Bluetooth headsets~\cite{gao2016}. The authors used a traditional machine learning approach with a support vector machine (SVM) classifier as well as a deep learning approach. In a lab setting (N = 28), both approaches yielded 94-96\% classification accuracy, but in a free-living environment (N = 4) accuracy fell. The deep learning approach was still able to achieve 76\% accuracy, showing better resilience to noise, but nonetheless a dramatic decrease. The researchers noted that ambient noise is the biggest hindrance for free-living acoustic eating detection.

Since acoustic methods for eating detection are limited by the presence of background noise, comfort, or socially awkward positioning of sensors and microphones, the form factor of eyeglasses has also been tested. Farooq and Sazonov designed a device worn on eyeglasses that incorporated a piezoelectric strain sensor positioned over the temporalis muscle and an accelerometer to detect food intake~\cite{farooq2016}. Their approach with SVM classifiers and a decision tree resulted in an average F1 score above 99\%. Amft and Zhang later used a similar form factor equipped with an electromyography (EMG) sensor to detect chewing and reported an F1 score of 95\%~\cite{zhang2018}. Cameras have also been used in some efforts to monitor energy intake. Doulah et al. used this approach with glasses equipped with an accelerometer, a strain sensor over the temporalis muscle, and a wide-angle camera for food image capture (N = 30)~\cite{doulah2020}. It was designed to only take pictures when the individual was detected to be eating and achieved an eating episode detection accuracy of 83\%. 

It has been shown that eating activity can also be detected by monitoring wrist motion with inertial measurement units (IMUs). Most IMUs include gyroscopes and accelerometers and some include magnetometers. When worn on an individual's wrist, these sensors provide information on the orientation and movement of the hand being monitored. Unlike acoustic methods that employ microphones or visual methods that employ cameras, wrist-worn devices do not threaten personal privacy. Moreover, the watch form factor is approachable and many people are already accustomed to wearing watches or fitness trackers. In fact, surveys have shown that a watch form factor is preferred for diet monitoring technology by a sizable margin~\cite{kalantarian2017}.

\begin{figure}
\centering
\includegraphics[width=0.65\textwidth]{img/rolling_wrist_motion.pdf}
\caption{Characteristic rolling motion of the wrist corresponding to a bite, adapted from~\cite{dong2009}.}
\label{fig:bite-motion}
\end{figure}

Dong, Hoover, and Muth found that there is a characteristic rolling motion in the wrist that occurs when taking a bite of food while eating (see figure \ref{fig:bite-motion})~\cite{dong2009}. The authors used this information to develop a rule-based algorithm to detect and count the number of bites taken. They collected wrist motion data from subjects in a controlled environment (N = 10) with an IMU device. The subjects were permitted to eat a meal of their choice with their desired utensils. With their method, the researchers reported a 91\% true positive rate for bites detected serving as a proof of concept for wrist-motion-based eating detection.

Dong et al. were also the first to develop a method to detect periods of eating in normal, day-to-day life as opposed to a laboratory setting. First, the researchers used a wired wrist-worn IMU device known as an InertiaCube3 connected to a laptop and a battery to track wrist motion (N = 4)~\cite{dong2011}. An activity classification accuracy of 91\% was reported with a rule-based algorithm. A state machine approach for eating detection was also developed with an 82\% true positive rate and 70\% precision. As a result of this work, the authors concluded that wrist motion could be used to segment eating episodes in natural, day-to-day life. 

Another work of Dong et al. used wrist-worn smartphones (Apple iPhone 4) to record accelerometer and gyroscope data from free-living participants instead of the previous apparatus (N = 43, 449 hours, 116 eating events). Their method involved using periods of vigorous wrist motion to bookend periods of eating, which they found typically have less wrist motion. Periods of eating activity were segmented with a naïve Bayesian classifier that yielded a reported 81\% accuracy.~\cite{dong2013}. 

In recent years, deep learning has made advancements in many fields, including eating detection from wrist motion for natural daily living. Stankoski et al. used a combination of traditional machine learning and deep learning to process smartwatch IMU data from free-living participants (N = 12)~\cite{stankoski2021}. Their research was focused on detecting eating segments rather than ingestion events (i.e. bites). The authors studied the relationship between model performance and cutlery type used for a meal as well as model performance with personalized models. The model performed better when the subject ate with utensils (as opposed to hands) and personalized models offered slight performance advantages on average. Overall, for their eating detection framework they reported a true positive rate of 81\% and precision of 85\%. 

Luktuke used a deep learning classifier, more specifically a CNN with residual connections, to segment and categorize various eating gestures from wrist motion (N = 276) ~\cite{luktuke2020}. The model architecture, which resembled that of U-Net \cite{unet}, achieved correct classification of 79.6\% of `bite' and 80.7\% of 'drink' gestures. Of all of the gestures in the publicly available Clemson Cafeteria Dataset that was used, 77.7\% were correctly classified. 

Kyritsis et al. proposed a bottom-up approach to automatically detect food consumption by amalgamating bites into meals~\cite{kyritsis2017, kyritsis2019}. Using data from an off-the-shelf smartwatch (N = 12), the authors achieved a 79\% weighted accuracy for eating episode detection with a neural network involving convolutional and recurrent layers~\cite{kyritsis2020}. Yet, the authors' latest approach was focused on detecting meals where a fork and/or spoon was the eating utensil of choice. This results in limitations based on the type of cutlery the individual uses (e.g. bare hands, chopsticks) and unpredictability if the user drinks outside of a meal. We believe these to be limitations of a bottom-up approach in general.

Sharma also used deep learning, but with a top-down approach to detecting eating on a much larger dataset~\cite{sharma2020}. Instead of analyzing eating episodes by grouping bites, eating episodes were segmented and classified based on overall wrist motion throughout the day, i.e. eating detection instead of bite detection. The Clemson All-Day (CAD) Dataset consisting of 4,680 hours of wrist motion data and 1,063 eating events collected from 351 participants was used for this research~\cite{cad2020}. To our knowledge this is still the largest publicly available data set of all-day wrist motion data. A sliding window approach, CNN, and hysteresis-based detector were used to detect and segment eating episodes from the wrist motion data. The results of this work were 89\% of all meals detected, 1.7 false detections for every true meal detected, and a time weighted accuracy of 80\%. This thesis is built upon the work from Sharma, so further information is included in chapter \ref{methods}.
 
Wei investigated training the model developed by Sharma~\cite{sharma2020} for individual participants~\cite{wei2021}. The goal was to capture specialized individual-specific eating patterns to improve overall eating episode detection. To do so, a new dataset was collected with at least 10 days of data from 8 different subjects. With individualized models, an increase in average weighted accuracy was reported (82\%), but the extent of the increase varied subject by subject.
\newpage
\section{Daily Context}
\label{motivation}

Previous work from our research group produced a CNN classifier used to process wrist motion data and output a continuous probability of eating~\cite{sharma2020}. This model is referred to as the ``windowed eating classifier'' or ``window-based eating classifier''. The wrist motion data was from the publicly available Clemson All-Day (CAD) dataset consisting of 4,680 hours of wrist motion data and 1,063 eating events collected from 351 participants~\cite{cad2020}. The probability of eating, or $P$($E$), output by the window-based classification model ranges from 0 to 1 throughout the day based on how likely an individual is to be eating based on their wrist movement. A higher $P$($E$) (closer to 1) corresponds to a higher likelihood of eating at that time. Figure \ref{fig:daily_PE} shows what the daily $P$($E$) looks like for one individual in the dataset. The goal of this work is to use the daily $P$($E$) data in a recurrent neural network model to achieve better overall eating episode detection and largely reduce the number of false detections. This model is referred to as the ``daily pattern classifier''. With their bottom-up approach, Kyritsis et al. used a base approximation window of 3.6 seconds~\cite{kyritsis2019}. The top-down approach Sharma used for the windowed eating classifier analyzed a sliding window of 6 minutes~\cite{sharma2020}. And, this work looks at an even wider window across an entire day (24 hours).

\begin{figure}
\centering
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/daily_PE_P2193_blank.pdf}
\caption{Probability of eating  $P$($E$) throughout an entire day showing 3 strong peaks, all of which are actual meals, and low background noise.}
\label{fig:daily_PE}
\end{figure}

Within the $P$($E$) data from an entire day, we see various periods of high $P$($E$) indicating a high likelihood of eating. We call these ``peaks''. These may accurately correspond with an actual eating event or merely wrist movement that closely resembles eating motions. The types of individual peaks we tend to see in daily $P$($E$) sequences are shown in figure \ref{fig:peaks}. The first type of peak (a) is flat, solid, and rectangular at values very close to 1.0. This variety of peak almost always corresponds with an actual eating event. The second category of peaks (b) is not as well-defined showing more dispersed $P$($E$) likely caused by secondary activities during eating like watching TV, using a smartphone, or talking with friends. Peak type (c) is a bifurcated peak that suggests the individual returned for seconds during a meal or rested between courses. Peak types (d) and (e) demonstrate the varying length of eating episodes. And finally peak type (f) shows a false detection of eating that resembles an authentic eating event. Even within these categories there is variability due to the daily routines, eating rate, or eating technique of different individuals. This suggests that a template matching approach would not be very useful. Thus, a neural network approach was chosen.

\begin{figure}
\centering
\includegraphics[width=\textwidth, trim=0 0.2cm 0 0.1cm, clip]{img/window_PE.eps}
\caption{Types of individual peaks seen in daily $P$($E$) data: (a) obvious meal (b) fluctuating response (c) bifurcated (d) short meal (e) long meal (f) false detection resembling meal.}
\label{fig:peaks}
\end{figure}

The motivating idea for this work is that the $P$($E$) from an entire day provides valuable insight about where eating occurs. On their own, the peaks do not contribute much information that can be used to conclude if they correspond to actual eating. However, this would be roughy the same amount of data analyzed by a windowed or convolutional approach. In short, this myopic approach is limited. We hypothesize that extended quotidian context could help improve classification of eating events by reducing the number of false detections. Moreover, a neural network model architecture designed to learn these temporal relationships and features could use them to accurately predict eating activity from the entire day in a post-hoc manner.

Figure \ref{fig:daily_PE} depicts that eating typically occurs at isolated periods of high $P$($E$). However, for this example the eating activity is very distinguishable since there is low background noise in the $P$($E$) signal. As mentioned earlier, there is also the possibility of false detections caused by gestures that resemble eating. A false detection or ``false positive'' would be where there is a period of high $P$($E$), but an actual eating event did not occur. For example, grooming activities that involve moving the eating hand to the face can cause false detections like fixing or brushing hair, adjusting glasses, or touching the face. This can also include instances like morning and bedtime routines where an individual may be brushing teeth, styling hair, shaving, or applying makeup.

We have observed a few contextual clues and patterns that help reduce the number of false detections when manually reviewing $P$($E$) sequences. We denote a period of time related to a pattern as an ``event of interest". As a note, in this work meals that occur before 12:00 are referred to as ``breakfast'', meals that occur between 12:00 and 16:00 are ``lunch'', those that happen after 16:00 are denoted as ``dinner'', and all small meals interspersed throughout the day are ``snacks''. 

\begin{figure}
\centering
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0 0.1cm 0, clip]{img/daily_PE_P2356_gt.pdf}
\caption{Daily sequence showing 6 strong peaks, of which 3 are actual meals and 3 are nearby transient responses. The transient responses are probably caused by a morning routine (A), food preparation (B), and cleanup (C), the latter two of which may include light snacking. }
\label{fig:example1}
\end{subfigure}
\\[24pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/daily_PE_P2483_gt.pdf}
\caption{Daily sequence showing 5 strong peaks, of which 3 are actual meals. The peak at A may be caused by snacking during food preparation. The peak at B is less likely to be an eating episode because of its proximity to two other peaks (eating episodes tend to be spaced multiple hours apart).}
\label{fig:example2}
\end{subfigure}
\\[24pt]
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/daily_PE_P2236_gt.pdf}
\caption{Daily sequence showing 5 strong peaks, of which 3 are actual meals, and higher than typical background noise. Peak A is probably caused by a morning routine, peak B is too proximal to other peaks, and the large rest area (C) provides context that the peaks prior to C and subsequent to C are more likely to be actual eating.}
\label{fig:example3}
\end{subfigure}
\caption{Daily $P$($E$) sequences with actual eating episodes (green bars) and other events of interest (orange shaded bars) highlighted.}
\end{figure}

First, a regularity of spacing of 4-6 hours between eating episodes is normal. Humans are less likely to eat several full-size meals in a short span of time. The daily meal schedule followed by most people adheres to this pattern as shown in figures \ref{fig:example1} - \ref{fig:example3}. In figure \ref{fig:example2}, the regularity of meal spacing would even help ignore the false detection around 16:30 (B). Similarly, the $P$($E$) sample shown in figure \ref{fig:example3} has many potential false detections early in the day (primarily B). Yet, these can be ignored by backtracking through the day and using the regularity of the strong meals as clues.

Second, an individual may snack on or taste food while preparing a meal or cooking, however this is not a true meal. This instance is usually indicated by two distinct peaks in close succession where the second peak is actually the meal. For example, the sequence shown in figure \ref{fig:example1}  demonstrates a case where the individual was possibly snacking while preparing dinner since there is a short interval of elevated $P$($E$) close to 18:00 (B). A similar pattern is seen after dinner (C) that may indicate a quick dessert or even seconds while cleaning up that did not constitute a full meal. Figure \ref{fig:example2} shows a daily sequence with a very long, bifurcated peak around 12:55 (A) that indicates possible light snacking before lunch. In the report for this day the individual noted that lunch was at a restaurant with friends, so the high $P$($E$) may correspond to eating an appetizer or animated conversation. 

Third, a morning routine generally occurs early in the day, so high $P$($E$) before the first real meal of the day (usually breakfast) could indicate a false detection. To illustrate this, figure \ref{fig:example1} shows a daily $P$($E$) sequence the exhibits noticeable evidence of a morning routine with a $P$($E$) peak before breakfast around 8:00 (A). The beginning of the sequence in figure \ref{fig:example3} likely demonstrates an extended morning routine from 6:45 to 7:15 (A) as well. This anomaly may even be present if the individual skipped breakfast. It would appear equally early in the recording, but without a meal following it. 

Lastly, periods of rest tend to occur before or after eating a meal. For instance, a person may take a midday nap or siesta in the period between lunch and dinner. In figure \ref{fig:example3}, a pattern indicating this behavior can be seen as there are the two main strong meals occur preceding and succeeding a large period of rest in the afternoon (from 15:00 to 20:00 [C]). Overall, these important contextual indicators and patterns can be seen throughout the dataset. The ones presented here are merely the most common, perceptible ones we detected in our analysis. A neural network can be expected to extract less recognizable, latent features as well, hence our approach.

% How do you propose to solve it?
\section{Novelty}
\label{novelty}

The novelty of this work is applying neural networks to analyze an entire day of data and segment episodes of eating activity. Past work has explored windowed approaches to this problem with CNNs operating on accelerometer and gyroscope data from IMU devices. Using the output of such a model, we are able to analyze daily context for eating patterns in efforts to improve eating episode detection and reduce false detections. Furthermore, to our knowledge, very few works have investigated eating detection for free-living subjects with a dataset of this scale (N = 351). Overall, this work strives to answer the following questions: 
\begin{enumerate}
\item Does analyzing the probability of eating in a daily context with a neural network improve eating episode classification?
\item Can this approach reduce the number of false detections in eating episode detection?
\item How do the results of this approach compare to those from a window-based classifier?
\end{enumerate}