Multi-label classification refers to automated classification in which multiple target labels are assigned to each instance. For example articles often contain several topics, and images contain multiple types of objects. Multi-label classification is a central problem in big data analysis. Complex data --- such as documents, images and videos --- require automated content annotation to support ranking, retrieval and monitoring operations. Unfortunately, automatic annotation is difficult because multiple labels, exhibiting complex inter-relationships, must be assigned from a large open vocabulary. This research project addresses the three main challenges faced by automated annotation systems that learn multi-label classifiers from data: (1) capturing and exploiting label dependence to overcome data sparsity, (1) exploiting partially labelled data to expand the range of usable resources, and (3)reducing prediction model size to allow practical usability. These three challenges will be tackled from a unified perspective of output representation learning, which has the potential to deliver automated methods for semantic annotation that demonstrate greater autonomy, robustness and accuracy. This research will be integrated into graduate and undergraduate courses, which will allow students to develop analytical and computational skills for big data analysis that are currently in high demand. Being centered at a university with a strong program for minority students, this project will also engage participation from underrepresented groups. Undergraduate and high school students will also be engaged through student project competitions.
The core technical challenges addressed by this research project arise from the phenomena of complex label spaces and sparse data: annotations in big data exhibit ontological structure and missing labels, while even in massive data collections, like Flickr, most labels do not have sufficient positive examples to allow an accurate classifier to be trained independently for each label. To address these challenges, this research project will pursue three main research aims: First, methods for learning multi-label output kernels will be developed that allow auxiliary label information to be combined with state of the art multi-label training losses. These methods will provide important new approaches for addressing the label dependence challenge. Second, methods for learning distributed label representations will be developed that also incorporate auxiliary label information with effective training losses. These methods will provide new approaches for addressing the label dependence and label dimension challenges that make an alternative computational trade-off to output kernel learning. Third, new methods for learning output representations from incomplete labels will be developed that combine missing label imputation with predictor training under effective multi-label losses. This work will greatly extend the practical applicability of multi-label classification learning methods to the type of partially labeled data that is usually encountered in big data. By using output representation learning to improve the quality of multi-label classifiers, this research will advance ranking and retrieval capabilities in important applications, including document and health record management, image and video management, and semantic web analysis. Moreover, by offering flexible engagement opportunities via diverse application studies, algorithm development, experimentation and analysis, this project is well suited to engaging students in research.