Video understanding and analysis has a wide range of applications. As a cornerstone in video understanding, visual tracking provides online motion information of objects of interests, such as walking pedestrians in autonomous driving, moving cells in bioengineering study, and deforming guidewire in medical intervention, to name a few. Despite recent advances in deep learning-based visual tracking systems, however, a significant gap remains between state-of-the-art algorithms and real-world applications. A conjecture is that the advantage of deep learning is not fully explored, especially due to the lack of large-scale quality tracking datasets. Currently, the largest published fully annotated tracking dataset contains less than 2,000 videos, which are hardly sufficient for effectively learning a robust tracking model. This project confronts the issue by directly and explicitly working on large-scale learning of tracking algorithms, and aims to improving tracker systems from various aspects including accuracy, efficiency, robustness, as well as generalization capability. The produced datasets, benchmark, diagnosis toolkit, tracking algorithms and temporal modeling techniques, will be made publicly available and expected to generate significant contributions to the computer vision and related fields.

The overall goal of this research is to push the frontier of visual object tracking though large-scale learning. The project divides the research activities into three thrusts. Firstly, large-scale quality tracking datasets will be constructed with full annotation. Based on such datasets, an online benchmark platform will be derived and a tracking diagnosis toolkit be developed for studying challenge factors in visual tracking. These results will provide the data basis, test beds, and analytic tools for facilitating research in visual tracking. Secondly, efforts will be devoted to improving the robustness of deep trackers against various challenge factors, by either optimizing tracker architectures or integrating predictions of these factors. Thirdly, effective deep temporal models will be developed in two ways: one implicitly encodes temporal information in joint spatial-temporal CNN structures, while the other develops attention-guided dual memory models.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
State University New York Stony Brook
Stony Brook
United States
Zip Code