The design of systems that can detect and recognize objects in large image and video repositories will enable significant developments in areas as diverse as life sciences, surveillance and law enforcement, entertainment, advertisement, and copyright protection, among others. While significant progress has been achieved in this area over the last decades, the design of such systems still requires vast amounts of expert knowledge and manual labor. This project lays the foundation for a long-term vision of recognition systems containing banks of recognition modules fully trainable by naive users, with minimal requirements in terms of manual data pre-processing and computational complexity. From a technical standpoint, the project addresses two fundamental barriers in the path to this objective: 1) the dependence of current classifiers on carefully assembled and pre-processed training sets, and 2) the training complexity of state-of-the-art classification architectures. The first is addressed through the introduction of a new statistical learning framework, denoted by weakly supervised assembly of training sets, which combines elements of discriminant visual saliency and image matching to automate the process of assembling the training sets required for detection and recognition. The second is addressed through the introduction of new, and computationally efficient, boosting methods for the design of cascades of large-margin classifiers, with support for both template and constellation-based object representations. At the educational level, the project will contribute to the advancement of the coverage of the recognition problem, through the introduction of new courses, and the development of a software library to be used as a teaching aid in visual information retrieval courses. The design of this library will also provide research opportunities for students from underrepresented backgrounds at a scale well beyond what can usually be found at the undergraduate level.