The following instructions assume you have the enhanced gestures GT and the videos_crop.txt file, both given at separate links at the website. -'gestures' folder: include 486 text files, each includes gesture ground truth for one video. -'videos_crop.txt': include the cropping window coordinates and the start&end timestamps of each video. How to read: 1. GT gesture file: the file is named with PARTICIPANT_COURSE.txt. Columns from left to right are: 1) gesture type, avaliable types are bite, drink, non-intake 2) the gesture start timestamp in millisecond, from the beginning of the video; 3) the gesture end timestamp in millisecond, from the beginning of the video; 2. 'video_info.txt': Columns from left to right are: 1) video name: in PARTICIPANT_COURSE format; 2) x coordinate of the top-left corner of the cropping window; 3) y coordinate of the top-left corner of the cropping window; 4) x coordinate of the bottom-right corner of the cropping window; 5) y coordinate of the bottom-right corner of the cropping window; 6) the start timestamp of the meal in millisecond, from the beginning of the video; 6) the end timestamp of the meal in millisecond, from the beginning of the video. *Notes: 1) all timestamps are counted from the beginning of the video recording. 2) all coordinates are in the frame coordinate system, where the origin is the top-left corner of the frame. 3) numbers in each line are seperated by a tab. To convert the video timestamps to frame index starting from the video's beginning: frame_idx = timestamp / 1000 * sampling_frame_per_second To convert the timestamp in the video gesture gt files to the datapoint index in the original gesture ground truth: ori_datapoint_idx = (video_timestamp - video_sync_offset) / 1000 * 15 where: 15 is the original sampling frequency as in wrist motion and scale data; video_sync_offset is the offset between the start time of video and wrist motion sensor; The offset value is the first value in *_sync.txt file under the same directory as the video file. Example: the first line in gestures/p005_c2.txt (Partcipant index is p005, course index is c2): bite 41207 43941 That means the first intake gesture in the video located in Cafeteria/p005/c2/ is bite. The start and end time of the gesture are 41207 ms and 43941 ms, timing from the start of the video. To sample the video with 8 Hz: the start frame index: 41207 / 1000 * 8 = 330 the end frame index: 43941 / 1000 * 8 = 352 To convert the timestamps to the datapoint index in the original gesture ground truth: the start datapoint index: (41207 - 1341)/ 1000 * 15 = 598 the end datapoint index: (43941 - 1341) / 1000 * 15 = 639 (By looking into the 20120201115556861_sync.txt under Cafeteria/p005/c2/, the offset value is 1341 (millisecond)) These datapoint indexes are exactly indentical with those in Cafeteria/p005/c2/gesture_union.txt.