A rapid acceleration in both volume and complexity of public domain and scientific data presents new and exciting challenges. This project aims to develop a theoretical framework for structured learning of distribution spaces and study tools for identifying and utilizing probabilistic structure in high-dimensional large volume data. This project lies within the intersection of multiple disciplines: signal processing, pattern recognition, machine learning, probability and statistics, and thus will foster collaboration among these disciplines. The application of the proposed framework to data-driven medical diagnosis and ecological research will further the impact of this project beyond the realm of computational data analysis. Additionally, this research sets a goal to enrich the quality of education for both undergraduate and graduate students, through exciting integration of research, application, and new curriculum.
The research framework consists of geometrically-constrained probabilistic modeling and efficient optimization approaches for inference of multiple instance data. The project sets forth the following tasks i) confidence-constrained joint estimation of multiple discrete probability models, ii) joint learning of multiple distribution based geometrically-constrained maximum-entropy models, and iii) direct application of the developed framework to the analysis of clinical flow cytometry data for medical diagnosis and in-situ bioacoustics data for ecological research.