This CAREER project aims to advance the state-of-the-art in theory and methods for extracting clusters and graphs from big and dirty datasets arising in modern application domains. Clusters and graphs provide a meaningful representation of the structure of information contained in data, e.g. in neuroscience and health care domains, clustering patients with similar phenotypes and genotypes helps identify target groups for drug design, clustering fiber tracks generated by high-resolution Digital Surface Imaging (DSI) scans of brains help identify significant neural pathways, and graph structures can reflect connectivity between brain regions. The results of this work will significantly enhance the ability to exploit such modern datasets through new methods for learning clusters and graphs from data that is large-scale, high-dimensional, under-sampled, corrupted, and often only available in a compressed or streaming representation.

Specifically, this project will develop computationally efficient and principled methods for learning clusters and graphs that can (i) perform unsupervised feature selection to discard irrelevant features in high dimensions, (ii) leverage feedback based on intelligent adaptive queries that focus resources on most informative variables and features, (iii) use compressive measurement design that adapts to the information structure for measurement and computation efficiency, and (iv) be able to handle noisy streaming data. The algorithms will be accompanied with performance guarantees in the form of a precise characterization of the mis-clustering rate and graph recovery error. Additionally, the project will investigate the tradeoffs between number of measurements, computational complexity and robustness in these problems. The methods and theory developed will be evaluated through simulations as well as their applicability to real datasets in neuroscience and healthcare domain, in collaboration with practitioners from these fields.

The results of this research could potentially transform many application domains that involve grouping similar variables and learning complex interactions between them, based on big and dirty datasets. In particular, the neuroscience and healthcare applications are likely have very direct and significant implications for society. Accurately mapping neural pathways will help diagnose and treat brain pathologies at an early stage, and help understand brain functioning. Clustering patients and discovering disease spreading pathways based on few measurements of relevant genetic features or indicators could help prevent and cure diseases, and also minimize healthcare costs. The research activities will be tightly integrated with education efforts that aim to develop a diverse workforce that is better equipped with cross-disciplinary tools to address the challenges of modern datasets. The education plan includes development of two inter-disciplinary courses, and enhancement of the joint Statistics & Machine Learning PhD program at Carnegie Mellon University (CMU). Outreach activities include promoting undergraduate research, broadening participation of women and underrepresented groups in STEM fields through OurCS (Opportunities for Undergraduate Research in Computer Science), Andrew?s Leap (a summer enrichment program for area high school and middle school students) and CS4HS program aimed at High School and K-8 teachers at Carnegie Mellon University. The results of this project (including publications, data sets, and software) will be disseminated online at www.cs.cmu.edu/~aarti/research_projects/.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1252412
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-03-01
Budget End
2018-02-28
Support Year
Fiscal Year
2012
Total Cost
$500,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213