\chapter{Methods}
\label{methods}

An overview of the methods for this work is shown in the flowchart in figure \ref{fig:method_overview}. The methods of this study begin with the window-based eating classifier from previous work~\cite{sharma2020}. This model is used to process wrist motion data and produce continuous $P$($E$) sequences from a day-length recordings, also referred to as ``daily samples''. For data augmentation purposes, the window-based eating classifier is used repeatedly to generate a large number of daily samples. The daily samples are then saved with the corresponding ground truth, pre-processed by padding, and prepared for the daily pattern classifier. This data is used to train and test the daily pattern classifier with the paradigm of 5-fold cross validation. The output of the model is post-processed with a thresholding algorithm to generate a sequence of predictions from the model. Lastly, various evaluation metrics are measured. As a point of confusion, both the daily pattern classifier and the windowed eating classifier produce a probability of eating as their output. For greater clarity, $P$($E_w$) corresponds to the probability of eating from the windowed eating classifier and $P$($E_d$) refers to the probability of eating output by the daily pattern classifier for the remainder of this work.

\begin{figure}[h!]
\centering
\includegraphics[width=0.75\textwidth]{img/method_overview.pdf}
\caption{Overview of the methods for this work.}
\label{fig:method_overview}
\end{figure}

\section{Data Augmentation}
\subsection{Window-based Eating Classifier}

\begin{figure}[b!]
\centering
\includegraphics[width=0.55\textwidth]{img/surya_nn.png}
\caption{Overview of the windowed eating classifier method involving a sliding window of length $W$ minutes and stride $s$ seconds used to generate a continuous probability of eating for an entire day, adapted from~\cite{sharma2020}.}
\label{fig:suryaprocess}
\end{figure}

To generate daily samples, first the windowed eating classifier was trained. Using a trained model, a sequence of model predictions were generated about the activity of an individual (eating/non-eating) that we call the $P$($E_w$). A brief recapitulation of the procedure for training this model is provided here. First, $z$-score normalized 15 Hz wrist motion data was separated into windows of length $W$ minutes with $s$ second stride between them. These ``window samples'' were used to train the windowed eating classifier. The ground truth used during training was based on a ``majority vote'' of the window. If more than half of a window contained eating, the window was marked as eating in the ground truth. Likewise for non-eating windows. Training was balanced with the same number of eating and non-eating windows to prevent classifier bias. After training, a $P$($E_w$) value ranging from 0 to 1 was output by the classifier indicating the likelihood that the provided input window sample contained eating. The $P$($E_w$) output was generated using model prediction, also known as model inference, where the output is predicted from an input using a trained deep learning model. By concatenating the outputs from adjacent windows, a continuous $P$($E_w$) sequence was generated for an entire day. This process is depicted in figure \ref{fig:suryaprocess}.

\subsection{Day-Level Data Augmentation}
\label{gen-data}

\begin{figure}[b!]
\centering
\includegraphics[width=\textwidth]{img/data_flow.pdf}
\caption{Flowchart of the data augmentation process to generate daily sample data for our daily pattern classifier.}
\label{fig:dataflow}
\end{figure}

The CAD dataset~\cite{cad2020} only contained 354 recordings of an entire day. We required a much larger set of daily samples in order to be able to train and test a day-level classifier.  Therefore, we used the CAD dataset to generate 565 different samples from each actual daily recording, yielding approximately 200,000 total daily samples. The data augmentation process is outlined in figure \ref{fig:dataflow}. To generate this day-level data, the windowed eating classifier was first trained on data from all 354 recordings. The standard training and prediction process described in the previous section was used to generate day-length $P$($E_w$) sequences for each subject in the dataset. An additional step was added to save the full daily samples to a file with the corresponding ground truth (as shown in figure \ref{fig:dailysample_file}). The ground truth was defined as the raw label (eating or non-eating) for the point at the center of the window in the raw data.  It is important to note that this varies from the ground truth labeling used for \textit{training} the windowed eating classifier. However, this does emulate the ground truth labeling for \textit{testing} the windowed eating classifier. The convention for labeling the ground truth was a value of 1 to indicate eating and a value of 0 to indicate non-eating. An example of a single daily sample is shown figure \ref{fig:dailysample}. 

\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{img/daily_sample_file.pdf}
\caption{Excerpt from a file containing a daily sample with $P$($E$) values on the top row and corresponding ground truth values on the bottom row.}
\label{fig:dailysample_file}
\end{figure}

\begin{figure}
\centering
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/example_daily_sample.pdf}
\caption{Plot of a daily sample showing the probability of eating throughout an entire day.}
\label{fig:dailysample}
\end{figure}

This three-part train, predict, and save procedure was repeated multiple times to achieve the desired quantity of generated samples. A total of 200,010 daily samples were generated to accommodate early estimates of the model size and complexity. For training, a window length $W$ of 6 minutes (5400 data) and a stride length $s$ of 15 seconds (225 data) were used as in previous work~\cite{sharma2020}. Each model was trained for 30 epochs and the model with the best training accuracy was saved. Training code from Sharma was used with minor modifications (e.g. multiple GPU optimization, repeated iterations)~\cite{sharma2020}.

The potential drawback of this approach is that daily samples generated repeatedly from the same daily recording may not have much variability and consequently result in overfitting in our day-level classifier. To explore this issue, we visualized the variance in daily samples generated from the a single actual day-length recording. Figure \ref{fig:model-volatility-ex} shows an example of the spread between the minimum and maximum values across all of the data generated from a single recording. Similar variance was observed across all actual daily recordings in the original dataset. Additionally, during training and testing of the daily pattern classifier, we do not split the daily samples generated from the same actual day recording across the train/test boundary. That is, daily samples were split by original recording identifier and not by arbitrary indices for training and testing splits. As found in earlier experiments, the volatility of the windowed eating classifier produced substantial genuine noise to precipitate this phenomenon. The average standard deviation of evaluation metrics measured per subject varied 5-10\% in subsequent model re-training runs. The full extent of this volatility is quantified and explored in appendix \ref{app:model-vol}.

\begin{figure}
\centering
\includegraphics[width=\textwidth, trim= 0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/P2452_volatility.pdf}
\caption{Difference (shading) between the minimum (lower dashed line) and maximum (upper solid line) values across all 565 samples generated from a single recording in the CAD dataset.}
\label{fig:model-volatility-ex}
\end{figure}

\subsection{Downsampling}
\label{downsampling}

A challenge in analyzing a day-length recording is that there is a large amount of data in a day. A classifier that considers all this data simultaneously would thus have a very large number of parameters and suffer from long training and inference times. Additionally, the time needed to generate the numerous 200,000+ samples had to be reasonable. In order to reduce this complexity, we downsample the day-length recording from 0.067 Hz (1 datum every 15 seconds) to 0.01 Hz (1 datum every 100 seconds or 1.67 minutes) during model inference. We seek to retain the overall daily pattern of $P$($E$) without modeling the small fluctuations that happen second to second.

\begin{figure}
\centering
\includegraphics[width=\textwidth, trim=0.1cm 0.2cm 0.1cm 0.2cm, clip]{img/PE_resolution_comparison.pdf}
\caption{Effect of stride length $s$ on resolution of a daily sample: (a) $s$ = 1 datum ($\frac{1}{15}$ second), (b) $s$ = 15 data (1 second), (c) $s$ = 150 data (10 seconds), (d) $s$ = 1500 data (100 seconds), (e) $s$ = 15000 data (1000 seconds).}
\label{fig:sampleresolution}
\end{figure}

We chose the downsampling factor using visual estimation.  Figure \ref{fig:sampleresolution} shows an example. In this figure daily samples are compared with different stride lengths. Each daily sample from (a) to (e) increases the stride length by an order of magnitude. Smaller increases did not substantially impact any of the prioritized objectives. Daily sample (a) was generated with a 1 datum stride, (b) with a 15 data stride (for a perfect 1 second), (c) with a 150 data stride, and so on. Daily samples (a) - (c) appear largely unchanged at the daily level. A stride length of 1500 data (d) is the first order of magnitude jump to show some downsampling. The next order of magnitude increase used to produce daily sample (e) is noticeably losing data. Thus, after scrutinizing this visual comparison of daily sample fidelity, a stride length of 1500 data was chosen because it offered downsampling without loss of the overall integrity of the signal. In summary, a 15 second (225 data) stride length was used for training and a 100 second (1500 data) stride length was used for prediction. 

The nature of the windowed eating classifier allowed different stride lengths to be used for training and testing (or prediction). As the name implies, this ability was enabled by the fact that it worked on sliding windows of data. As long as the window length remained constant, changing the stride length only changed the number of windows that would be processed by the model and the resolution of the resulting $P$($E$) signal. This functionality was exploited in the original work to improve data granularity for testing (down to a 1 datum stride)~\cite{sharma2020}. 

\section{Pre-Processing}

The characteristics of $P$($E$) data from the windowed eating classifier already confined it to a range of [0, 1]. Consequently, no feature scaling or normalization was used on the daily sample data; raw daily samples were used as the input to the daily pattern classifier. Uniform and Gaussian smoothing were tested, but both degraded performance so they were not used. Since each daily sample was a different length, all samples were padded to the maximum length (850 data = 23.6 hours = $850 \times 100 / (60 \times 60)$) with -1 values before processing. The value -1 was chosen because it was outside the range of the probability of eating. These values were masked out later in the model as described in section \ref{architecture}. The data was also reshaped to a 3D tensor with the shape [\verb|batch|, \verb|timesteps|, \verb|features|] that is required by RNNs in TensorFlow before processing. For this data the shape was [-1, 850, 1], where the -1 value allowed TensorFlow to determine the number of batches at runtime. 

\section{Daily Pattern Classifier}

A recurrent neural network design was used for the daily pattern classifier due to the strength of RNNs with variable-length time series data. The memory aspect of RNNs was also imperative to capture important indicators from the daily context. Other architectures were tested including a fully-connected network and 1D convolutional neural network with residual connections (based on U-Net~\cite{unet}). Both approaches had drawbacks. The fully-connected ``dense'' network was limited by the nature of its architecture. Dense architectures are designed to learn the relationship between a fixed-size input and a fixed-size output with both as aggregate units. We needed an architecture that could operate over a sequence datum-by-datum and learn to classify slices of that sequence. The 1D CNN architecture was also restrained as it fell back to a similar myopic view of the daily sequences. Appropriately, it demonstrated poor performance and limitation in learning attributes of the daily patterns. Furthermore, since our data had inconsistent lengths, it would need to be truncated, padded, or divided into fixed-length sliding windows for processing with these networks. Truncating would have removed useful data, padding would have added unnecessary data, and a sliding window approach is antithetical to the goal of capturing daily context. Therefore, an RNN was built for the task.

\subsection{Architecture}
\label{architecture}
% Model architecture

The architecture of the daily pattern classifier includes a masking input layer, a single bidirectional GRU layer, and a time-distributed dense output layer. The bidirectional GRU layer uses tanh activation functions and the output dense layer uses sigmoid activations. A diagram of the model architecture is shown in figure \ref{fig:architecture}. Overall, the model includes 1,841 trainable parameters. The classifier was designed, trained, and evaluated using TensorFlow 2.2.0 and the Keras API in Python 3.8.3. 

\begin{figure}
\centering
\includegraphics[width=0.6\textwidth, trim=0.1cm 0.1cm 0.1cm 0.1cm, clip]{img/daily_pattern_model_architecture_final.png}
\caption{Recurrent neural network architecture for the daily pattern classifier (`?' is a placeholder for the number of batches).}
\label{fig:architecture}
\end{figure}

\subsubsection{Masking Layer}
As described previously, the input was padded to with -1 values to a length of 850 data. The first layer of the network, a masking layer, was used to skip these invalid timesteps. That way, the computation of gradients and weights within the network was only dependent on real data and not padded zeros or other fabricated values.

\subsubsection{Bidirectional GRU Layer}
The next layer in the model architecture was a gated recurrent unit (GRU) layer enclosed in a bidirectional wrapper. The default GRU in Keras was used with the reset gate applied to the hidden state before matrix multiplication. A bidirectional wrapper was added around the GRU layer to compute  weights in a forward pass of the input first and then a flipped backward pass. The summation of the outputs from the forward and backward pass was computed instead of the default concatenation as this was found to offer slightly better performance. This was done by setting the \verb|merge_mode| parameter to \verb|sum| in Keras. Xavier normal initialization was used (\verb|glorot_normal| kernel initializer in TensorFlow) for the weights of this layer because it yields better initial values that are related to the structure of the layer. The default activation functions were used. The hyperbolic tangent (tanh) function was used as the activation function for the layer and the sigmoid function was used as the recurrent activation function between recurrent steps within the layer. For the GRU layer, $U$ units were used. Different values of $U$ were tested including 8, 16, 32, 64, 128, and 256 (i.e. powers of 2 from $2^3$ to $2^8$). Both GRU and LSTM layers were also tested. The full results of this grid search for the best number and type of units is shown in section \ref{mem_units}. 

\subsubsection{Time-distributed Dense Layer}
Lastly, a single-unit dense layer was encapsulated in a time-distributed wrapper for the model output. The TimeDistributed wrapper in Keras applies the enclosed layer to each timestep of the input. This makes it possible to produce an output for each data point in the input when combined with a 1-unit dense layer. Hence the input and output dimensions of the overall model are the same: 850 data. A sigmoid activation function was used on the output of the dense layer to limit the output to the range [0,1].

\subsection{Training}

\begin{figure}
\centering
\includegraphics[width=\textwidth]{img/kfold_cross_validation.pdf}
\caption{Overview of $k$-fold cross validation, $k=5$.}
\label{fig:kfold}
\end{figure}

% Training/evaluation
Training was performed using $k$-fold cross validation. In general, $k$-fold cross validation is a technique to evaluate the performance of a neural network on all of the data in a dataset. The data is split into $k$ different groups or ``folds'' and one fold is reserved for testing while the other folds are combined for training. The value for $k$ is usually set to 3, 5, or 10; we will use $k = 5$. Figure \ref{fig:kfold} depicts 5-fold cross validation as it is used in this work. The folds were split by subject from the original CAD dataset~\cite{cad2020}, rather than by total data. This was done so there was no train/test overlap between samples that originated from the same recording. Each split included training data from 80\% of the recordings ($\approx$271 recordings) and testing data from 20\% of the recordings ($\approx$71 recordings). Accordingly, of the 200,000 available daily samples, each split involved training on $\approx$160,000 samples and testing on $\approx$40,000 samples. Model training was performed on the Clemson University Palmetto Cluster, a high-performance computing (HPC) system, on a compute node equipped with 40 cores, 370 GB of RAM, and 2 NVIDIA Tesla V100 graphics cards with a combined 32 GB VRAM.

\subsection{Network Hyperparameters}

Hyperparameters of a neural network are the adjustable controls outside the layers of the neural network architecture. There are several that need to be set to reliably and repeatably train a neural network like the number of training epochs and the batch size. The necessary hyperparameters for the daily pattern classifier considered in this work are defined here:

\begin{enumerate}[label=\textbf{\arabic*}.]
	\item {\bfseries Loss Function:} Binary cross entropy loss was used because the model was performing the binary classification task of determining whether each timestep was eating (1) or non-eating (0). The binary cross-entropy loss function is defined as follows, where $N$ is the number of training samples, and $y_i \in \{0, 1\}$ is the target class:

\begin{equation}
\mathcal{L}_{\text{BCE}} = \frac{1}{N} \sum^N_{i=0} y_i \cdot \text{log}(\bold{P}(y_i)) + (1 - y_i) \cdot \text{log}(1-\bold{P}(y_i))
\end{equation}
	\item {\bfseries Number of Training Epochs:} The daily pattern classifier was trained for 50 epochs and the model with the best training accuracy was saved. Training was tested up to 100 epochs, but upon inspecting the training accuracy and cross entropy loss on the training data over this range, we only noticed marginal improvement after 50 epochs. Furthermore, training for longer than 50 epochs did not result in any performance advantage on evaluation metrics. 
	\item {\bfseries Optimizer:} The Adaptive Moment Estimation (Adam) optimizer was utilized for training. It builds on classical stochastic gradient descent by using adaptive learning rates for different parameters in the network. The Adam method incorporates elements from the RMSprop and Adadelta optimizers. As a result, it is quite popular and one of the best overall optimizers~\cite{ruder2016}. All parameters for the Adam optimizer were left set to the default values in TensorFlow.
	\item {\bfseries Learning Rate:} Closely related to the optimizer is the learning rate. The initial learning rate for the Adam optimizer was also left at the standard value of 0.001. Training with this value was predominantly stable. Training accuracy and loss were not erratic or slow to converge, so the learning rate was left consistent with TensorFlow defaults.
	\item {\bfseries Batch Size:} A batch size of 64 was used to balance training time per epoch and training accuracy. The batch size is commonly selected as a power of 2, seeing that this has been found to improve training time by best utilizing parallel processing on GPUs. The value of 64 was the next step up from the default 32. We decided to increase the batch size by one notch to speed up training but avoid changing it too much and disrupting the results.
\end{enumerate}

\section{Post Processing}

\begin{figure}[b!]
\centering
\includegraphics[width=0.7\textwidth]{img/thresholding_algorithm.pdf}
\caption{$P$($E_d$) signal before and after being processed with the single-value thresholding algorithm ($T = 0.25$).}
\label{fig:T_algo}
\end{figure}

For post-processing, each sequence of $P$($E_d$) probabilities output by the daily pattern classifier was passed through a thresholding algorithm. A single threshold $T$ was used for the algorithm, so if a datum in the sequence was above $T$, the datum was classified as eating (labeled 1). Likewise, if a datum was below $T$, it was classified as non-eating (labeled 0). This is shown mathematically in equation \ref{eqn:thresholding}, where $f(x)$ is the thresholding function and $x$ is the input  $P$($E_d$) value.
\begin{equation}
f(x) = \begin{cases}
1\qquad \text{if } x > T\\
0\qquad \text{otherwise}
\end{cases}
\label{eqn:thresholding}
\end{equation}

Different values of $T$ were tested. Other heuristics were also employed including merging detections within a half window length of each other and ignoring detections less than 1 minute entirely. The result was a binary sequence indicating the model predictions for where eating occurred. An example of a $P$($E_d$) sample processed by this thresholding algorithm is shown in figure \ref{fig:T_algo}. This prediction was then compared with the ground truth sequence to calculate the metrics described in section \ref{metrics}.

For the window-based eating model a dual-threshold hysteresis approach was used for post-processing instead. In this approach the $P$($E$) signal would have to exceed a starting threshold $T_S$ and fall below an ending threshold $T_E$ to count as a detection. As this was used in previous work~\cite{sharma2020}, this was the default for early testing. However, there was less noise in the $P$($E_d$) signal than the $P$($E_w$) signal, so we decided to use a single threshold algorithm instead.

\section{Evaluation}
\label{metrics}

\begin{table}
\centering
\renewcommand\arraystretch{1.5}
\setlength\tabcolsep{0pt}
\begin{tabular}{c >{\bfseries \hspace{1em}}r @{\hspace{0.7em}}c @{\hspace{0.4em}}c @{\hspace{0.01em}}c}
  \multirow{10}{*}{\rotatebox{90}{\parbox{8cm}{\bfseries\centering Actual Class}}}
   & & \multicolumn{2}{c}{\bfseries Predicted Class} & \\
  & & \bfseries Eating & \bfseries Non-Eating & \bfseries Total \\
  & {\rotatebox[origin=bl]{90}{\bfseries Eating}} & \CMBox{True}{Positive}{(TP)} & \CMBox{False}{Negative}{(FN)} & \TotalBox{Actual}{Condition}{Positive}{(P)} \\[2.4em]
  & {\rotatebox[origin=tc]{90}{\bfseries\phantom{Ml}Non-Eating}}& \CMBox{False}{Positive}{(FP)} & \CMBox{True}{Negative}{(TN)} & \TotalBox{Actual}{Condition}{Negative}{(N)} \\
\end{tabular}
\caption{Eating classifier confusion matrix}
\label{tab:confusion_matrix}
\end{table}

There are two types of evaluation metrics associated with an eating detection classification model: time and episode metrics~\cite{eatingmetrics}. For this work, both sets of metrics were calculated from the binary prediction sequences after post-processing had occurred. First, time metrics were calculated by comparing the prediction output to the ground truth (GT) for each timestep. As mentioned earlier, the GT was constructed using the raw label value at the center of each window that produced a $P$($E_w$) data point. Essentially, this downsampled the GT to the stride length of the data for evaluation so it was consistent with the output of the model. All time metrics computed were based on the four categories of detection in a standard confusion matrix. Any eating classifier would have the same four values. A true positive (TP) occurred if the classifier predicted that a timestep was eating and it was also eating in the ground truth. A true negative (TN) was recorded if the classifier predicted non-eating for a timestep that was non-eating in the GT. A false positive (FP) was marked if the classifier predicted eating when the data point was actually non-eating. And, a false negative (FN) was logged when the classifier missed an eating timestep and instead marked it as non-eating. The confusion matrix for these four detection counts is shown in table \ref{tab:confusion_matrix}. Two additional quantities can also be computed: actual condition positive count ($\text{P} = \text{TP} + \text{FN}$) and actual condition negative count ($\text{N} = \text{TN} + \text{FP}$). The labeling convention for these types of detections is shown in figure \ref{fig:time_detections}.

\begin{figure}
\centering
\includegraphics[width=0.6\textwidth]{img/time_metrics.pdf}
\caption{Labeling of eating time metrics between ground truth meal and model detection: true positive (TP), true negative (TN), false positive (FP), and false negative (FN)}
\label{fig:time_detections}
\end{figure}
\begin{figure}[b!]
\centering
\includegraphics[width=0.95\textwidth]{img/episode_metrics.pdf}
\caption{Labeling of eating episode metrics between ground truth meals and model detections: true positive (TP), false positive (FP), and false negative (FN)}
\label{fig:episode_detections}
\end{figure}

Second, episode metrics were calculated by comparing the overlap of model predicted eating episodes and GT eating episodes. In this context, an eating episode is defined as a continuous interval of time classified as eating (i.e. a series of ones). If there was any overlap between the model prediction and the ground truth, the episode was counted as a true positive (TP). A false positive (FP) eating episode occurred when the model predicted an eating episode when there was not one in the GT data. And, a false negative (FN) was constituted by the classifier missing a ground truth event. There is no concept of a true negative (TN) eating episode for episode metrics with this data. The different types of episode detections are shown in figure \ref{fig:episode_detections}.

The metrics for both categories were calculated from these detection categories. We considered the following time metrics: weighted accuracy (Acc$_W$), true positive rate (TPR), true negative rate (TNR), F$_1$ score, and precision (equations \ref{eqn:wacc} - \ref{eqn:prec} on page \pageref{eqn:wacc}). Weighted accuracy was considered instead of raw accuracy because of the class imbalance in the data. Eating was far less common than non-eating. For weighted accuracy, $W$ is the weighted balance ratio of non-eating to eating calculated prior to this metric by fold. TPR is also known as sensitivity or recall and TNR is also known as specificity. The selected terms were chosen for clarity.

\begin{equation}
\label{eqn:wacc}
\text{Acc}_W = \dfrac{W \cdot \text{TP} + \text{TN}}{W \cdot \text{P} + \text{N}}
\end{equation}
\begin{equation}
\text{TPR} = \dfrac{\text{TP}}{\text{TP} + \text{FN}}
\end{equation}
\begin{equation}
\text{TNR} = \dfrac{\text{TN}}{\text{TN} + \text{FP}}
\end{equation}
\begin{equation}
F_1 = \dfrac{\text{TP}}{\text{TP} + 1/2 \cdot (\text{FP} + \text{FN})}
\end{equation}
\begin{equation}
\label{eqn:prec}
\text{Precision} = \dfrac{\text{TP}}{\text{TP} + \text{FP}}
\end{equation}

For episode metrics, we only considered TPR and F$_1$ score, each calculated in the same manner but with episode instead of time detection counts. We also evaluated the number of false positives per true positive (FP/TP) as an indicator of how much the classifier was triggering for other events throughout the day.

Evaluation metrics were calculated after summing up all of the time TPs, TNs, FPs, and FNs and the episode TPs, FPs, and FNs accordingly, not after processing each recording. Furthermore, evaluation metrics were only measured for data within the original recording length, not the entire padded sequence.

\section{Runtime Considerations}

\subsection{Data Loading}
A problem of this scale required specific implementation details to perform well without requiring an exorbitant amount of time. First, to import the 200,000 daily samples from individual text files, the files were read in parallel using the \verb|multiprocessing| module in Python to decrease load times. However, the Python scripts for the daily pattern classifier needed to be run many times during development and testing, so additional runtime optimizations were added. The data was loaded from all 200,000 text files only once and saved to binary NumPy files in the proprietary \verb|.npy| format for subsequent usage. This dramatically improved load times from over 40 minutes to just 2 seconds (1200x faster). Additionally, the binary NumPy files provided greater storage efficiency requiring 67\% less space (1.44 GB instead of 4.3 GB). Text files (\verb|.txt|) and comma-separated values files (\verb|.csv|) are some of the least efficient ways to store and load numerical data in Python. Both were tested during the process of optimization and found to be suboptimal.

\subsection{Model Inference}

Model inference or model prediction was needed to generate the daily samples as described in section \ref{gen-data}. Since there were 200,000 samples overall, the 20\% testing set for each fold of $k$-fold cross validation still included 40,000 samples. To improve model inference runtime, these samples were grouped into very large batches (e.g. 4,096) using the \verb|batch_size| parameter on the \verb|model.predict()| function. Unlike the batch size for training, the batch size for testing has no impact on performance since the model has already been trained. The only limit is the system memory, or more commonly GPU memory when training on a GPU.