This project develops new techniques for visually interpreting an image in a way that specifically leverages large image collections, now common on the web and elsewhere. The research team uses an approach whose performance directly scales with the size of the dataset, unlike many existing approaches to image understanding. The basic approach is to build a copy of a query image by assembling pieces of image from a large set of training images, in the manner of a jigsaw. Each region in the query is classified by copying labels from the matched regions. The larger the training set, the more jigsaw pieces there are to choose from, thus the more accurate the match.
The initial work of the project focuses on developing efficient methods for performing the matching that allow the incorporating of various desirable constraints. The approach is then extended to handle training data with incomplete labels -- important since few datasets have labels for every region. The research plan also includes building better embeddings for the regions which place semantically similar regions closer together than current representations do, and developing efficient binary matching schemes along with further work on the region embeddings.
Robust techniques for visual recognition have widespread applicability, in such areas as image search, robotics and surveillance. The project also involves extensive outreach activities, including high-school internships and the organization of a NY-area vision day for students and researchers