Ubiquitous cameras, together with ever increasing computing resources, are dramatically changing the nature of visual data and their analysis. Cities are adopting networked camera systems for policing and intelligent resource allocation, and individuals are recording their lives using wearable devices. For these camera systems to become truly smart and useful for people, it is crucial that they understand interesting objects in the scene and detect ongoing activities/events, while jointly considering continuous 24/7 videos from multiple sources. Such object-level and activity-level awareness in hospitals, elderly homes, and public places would provide assistive and quality-of-life technology for disabled and elderly people, provide intelligent surveillance systems to prevent crimes, and allow smart usage of environmental resources. This project will investigate novel computer vision algorithms that combine 1st-person videos (from wearable cameras) and 3rd-person videos (from static environmental cameras) for joint recognition of humans, objects, and their interactions. The key idea is to combine the two views' complementary and unique advantages for joint visual scene understanding. To this end, it will create a new dataset, and develop new algorithms that learn to recognize objects jointly across the views, learn human-object and human-human relationships through the two views, and anonymize the videos to preserve users' privacies. The project will provide new algorithms that have the potential to benefit applications in smart environments, security, and quality-of-life assistive technologies. The project will also perform complementary educational and outreach activities that engage students in research and STEM.
This project will develop novel algorithms that learn from joint 1st-person videos (from wearable cameras) and 3rd-person videos (from static environmental cameras) for joint recognition of humans, objects, and their interactions. The 1st-person view is ideal for object recognition, while the 3rd-person view is ideal for human activity recognition. Thus, this project will investigate unique solutions to challenging problems that would otherwise be difficult to overcome when analyzing each viewpoint in isolation. The main research directions will be: (1) creating a benchmark 1st-person and 3rd-person video dataset to investigate this new problem; and developing algorithms that (2) learn to establish object and human correspondences between the two views; (3) learn object-action relationships across the views; and (4) anonymize the visual data for privacy-preserving visual recognition.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.