Technological advances have led to a rapid proliferation of high-throughput "omics" data in medicine that hold the key to clinically effective personalized medicine. To realize this goal, statistical and computational tools to mine this data and discover biomarkers, drug targets, disrupted disease networks, and disease sub-types are urgently needed. There are, however, two primary factors which make the development of such statistical tools challenging. First, many high-throughput genomic technologies produce varied heterogeneous data, which include continuous data (microarrays, methylation arrays), count data (RNA-sequencing), and binary/categorical data (SNPs, CNV). These varied data sets do not always satisfy typical distributional assumptions imposed by standard high-dimensional statistical models. Second, in order for scientists to leverage all of their data and understand the complete molecular basis of disease, these varied omics data sets need to be combined into a single multivariate statistical model. This proposal seeks to address these two issues with a new statistical framework for integrated analysis of multiple sets of high-dimensional data measured on the same group of subjects. The key statistical approach uses the theory of exponential family distributions to generalize two foundational high-dimensional statistical frameworks, principal components analysis (PCA) and graphical models, so as to jointly analyze transcriptional, epi-genomics and functional genomics data.

This research will be applied to high-throughput cancer genomics data and lead to new methods to (a) discover molecular cancer sub-types along with their genomic signatures and (b) build a holistic network model of disease. By leveraging information across all the different types available of genomic biomarkers, the proposed methods will have the potential to make scientific discoveries critical for personalized medicine. The proposed work will also be broadly applicable to integrating multiple sets of "omics" data, including genomics, proteomics, metabolomics, and imaging. Beyond medicine, the theoretical framework and statistical methods will make significant advances in the theory of exponential families, statistical learning, and the emerging field of integrative analysis as well as have broad applicability in other disciplines such as engineering and security. All results will be disseminated through publications, conferences, and open-source software; this research will also provide training and educational opportunities for doctoral and postdoctoral scholars.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Nandini Kannan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Baylor College of Medicine
United States
Zip Code