Adaptive Fragments-Based Tracking of
Non-Rigid Objects Using Level Sets

Prakash Chockalingam, Nalin Pradeep, and Stan Birchfield

Abstract

We present an approach to visual tracking based on dividing a target into multiple regions, or fragments. The target is represented by a Gaussian mixture model in a joint feature-spatial space, with each ellipsoid corresponding to a different fragment. The fragments are automatically adapted to the image data, being selected by an efficient region-growing procedure and updated according to a weighted average of the past and present image statistics. Modeling of target and background are performed in a Chan-Vese manner, using the framework of level sets to preserve accurate boundaries of the target. The extracted target boundaries are used to learn the dynamic shape of the target over time, enabling tracking to continue under total occlusion. Experimental results on a number of challenging sequences demonstrate the effectiveness of the technique.

Algorithm

The visual tracking approach based on GMM modeling of the object and the level set framework can be summarized as follows:

Initial Frame:

The user marks the object to be tracked.
The target object and background scene are segmented based on their appearance similarity.
The target object and background scene are modeled using a mixture of Gaussians where each Gaussian correspond to a fragment that is parameterized using its appearance (RGB) and spatial (xy) mean and covariance.

Subsequent Frames:

The spatial components of the Gaussian mixtures are updated using a joint Lucas-Kanade algorithm
Each pixel is classified into either foreground or background by generating a strength map using the Gaussian mixture model (GMM) of the object and background.
The strength map is integrated into a level set formulation to obtain an accurate contour of the object.
The parameters of the GMM are updated using the tracked data.

Results

Below are some experiments showing the tracker's performance on multi-modal objects, objects undergoing drastic shape changes and large unpredictable frame-to-frame motion. We also handle full occlusion. To accomplish this, the shape of the object contour is learned over time by retaining the output of the tracker in each image frame. To detect occlusion, the rate of decrease in the object size is determined over the previous few frames. Once the object is determined to be occluded, a search is performed in the learned database to find the contour that most closely matches the one just prior to the occlusion using a Hausdorff distance. Then as long as the target is not visible, the subsequent sequence of contours occurring after the match is used to hallucinate the contour. Once the target reappears, tracking resumes. This approach prevents tracker failure during complete occlusion and predicts contours when the motion is periodic. Click on any of the images to download its corresponding MPEG file.

MPEG video clip Description

4311 KB Tickle Me Elmo doll: The benefit of using a multi-modal framework is clearly shown, with accurate contours being computed despite the complexity in both the target and background as Elmo stands tall, falls down, and sits up.

1714 KB Monkey sequence: A sequence in which a monkey undergoes rapid motion and drastic shape changes. For example, as the monkey swings around the tree, its shape changes substantially in just a few image frames, yet the algorithm is able to remain locked onto the target as well as compute an accurate outline of the animal.

1583 KB Walk behind tree sequence: A sequence in which a person walks behind a tree in a complex scene with many colors. Our approach predicts both the shape and the location of the object and displays the contour accordingly during the complete occlusion.

2395 KB Girl sequence: A more complex scenario where a girl, moving quickly in a circular path (a complete revolution occurs in just 35 frames), is occluded frequently by a boy. Our approach is able to handle this difficult scenario.

1810 KB Walk behind car sequence: A sequence in which a person is partially occluded by a car. Though we do not handle partial occlusion, the tracker adjusts the contour to only the visible portions of the moving person. Note that though the contours are extracted accurately for the body of the person, there are some errors in extracting the contours for the face region due to the complexity of the skin color.

4402 KB Fish sequence: The fish are multicolored and swim in front of a complex, textured, multicolored background. Note that the fish are tracked successfully despite their changing shape.

MPEG video clip	Description
4311 KB	Tickle Me Elmo doll: The benefit of using a multi-modal framework is clearly shown, with accurate contours being computed despite the complexity in both the target and background as Elmo stands tall, falls down, and sits up.
1714 KB	Monkey sequence: A sequence in which a monkey undergoes rapid motion and drastic shape changes. For example, as the monkey swings around the tree, its shape changes substantially in just a few image frames, yet the algorithm is able to remain locked onto the target as well as compute an accurate outline of the animal.
1583 KB	Walk behind tree sequence: A sequence in which a person walks behind a tree in a complex scene with many colors. Our approach predicts both the shape and the location of the object and displays the contour accordingly during the complete occlusion.
2395 KB	Girl sequence: A more complex scenario where a girl, moving quickly in a circular path (a complete revolution occurs in just 35 frames), is occluded frequently by a boy. Our approach is able to handle this difficult scenario.
1810 KB	Walk behind car sequence: A sequence in which a person is partially occluded by a car. Though we do not handle partial occlusion, the tracker adjusts the contour to only the visible portions of the moving person. Note that though the contours are extracted accurately for the body of the person, there are some errors in extracting the contours for the face region due to the complexity of the skin color.
4402 KB	Fish sequence: The fish are multicolored and swim in front of a complex, textured, multicolored background. Note that the fish are tracked successfully despite their changing shape.

Comparison

To provide quantitative evaluation of our approach, we generated ground truth for the experiments by manually labeling the object pixels in some of the intermediate frames (every 5 frames for the monkey, tree and car sequences, every 10 frames for Elmo, and every 4-6 frames for the girl sequence, avoiding occluded frames in the latter). We computed the error of each algorithm on an image of the sequence as the number of pixels in the image misclassified as foreground or background, normalized by the image size. We compared

our algorithm
an algorithm in which the strength image was computed using the linear RGB histogram representation of Collins et al. [1]
an algorithm in which the strength image was computed using a standard color histogram, similar to [2,3,4,5].

In the latter two cases the contours were extracted using the level set framework, but the fragment motion was not used. To evaluate the importance of using fragment motion, we also ran our algorithm without this component. Note that both versions of our algorithm were automatic, whereas the linear RGB histogram and the RGB histogram were manually restarted after every occlusion to simulate what they would be capable of achieving even with a perfect module for handling full occlusion. The table below provides the original BMP image sequences, the ground truth data, and a comparison of the results of the three approaches. The fourth column shows a plot of the normalized error against the frame numbers.


Original BMP Sequence	Ground Truth Data	Comparison Results
19504 KB	48 KB	3127 KB
17635 KB	52 KB	2019 KB
23576 KB	25 KB	1186 KB
33464 KB	22 KB	806 KB
42417 KB	49 KB	1693 KB

References

[1] R. Collins, Y. Liu, and M. Leordeanu. On-line selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1631 – 1643, Oct. 2005

[2] A. Yilmaz, X. Li, and M. Shah. Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11):1531–1536, Nov. 2004

[3] T. Zhang and D. Freedman. Tracking objects using density matching and shape priors. In Proceedings of the International Conference on Computer Vision, volume 2, pages 1056–1062, 2003

[4] S. Jehan-Besson, M. Barlaud, G. Aubert, and O. Faugeras. Shape gradients for histogram segmentation using active contours. In Proceedings of the International Conference on Computer Vision, volume 1, pages 408–415, 2003

[5] Y. Shi and W. C. Karl. Real-time tracking using level sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 34–41, 2005.

Publications

Prakash Chockalingam, Nalin Pradeep and Stan Birchfield, Adaptive Fragments-Based Tracking of Non-Rigid Objects Using Level Sets, Proceedings of the IEEE Conference on International Conference on Computer Vision (ICCV), Kyoto, Japan, September 2009

Adaptive Fragments-Based Tracking of Non-Rigid Objects Using Level Sets