In this proposal, the PI will analyze current genomic data using powerful modern machine learning methods to help make personalized medicine for each patient a reality. Imagine using genomic data and clinical traits to accurately predict risk for preterm birth early in gestation, or a college student's risk for future heart attacks, and tailoring preventative measures to the specific risk mechanisms. This research addresses a core theme of machine learning methods applied to scientific data: how to robustly and efficiently build predictive models using complex hidden structure in high dimensional data with limited numbers of samples, as is common in genomic and biomedical data. This proposal addresses a number of fundamental questions: How to use correlation among features to share strength across limited numbers of samples? How to test for causality of an observation in a cell on disease? How to encode biological structure in nonlinear functions? These fundamental questions in applied machine learning and statistical genetics will be addressed through the creation of hierarchical models and methods for computationally tractable analyses. These projects will enable recovery of genomic signals with predictive ability essential for personalized medicine. The PI also plans active engagement with underrepresented minorities in computer science and making publicly available software.

This research aims to develop computationally tractable structured hierarchical models to find complex signals in genomic data that are hidden to current methods that will be used to build predictive models using existing genomic study data, and to use these predictive variants to precisely quantify disease risk for each patient. Success of these goals impacts personalized medicine, enabling a complete understanding of the genetic regulators of disease and making individual-specific disease risk prediction and treatment a reality. Although linear models have been used to analyze scientific data for 125 years, these methods assume unlimited availability of samples and simple linear structure, and fail to recover variants with more complex associations. In genomic data, predictive signal is often compositional, including linear, sparse, low-rank, or nonlinear structure. This proposed research will drastically shift current scientific data analysis by developing efficient methods that recover predictive genetic variants with complex effects. This research is organized around three integrated projects. 1) High-dimensional correlations. Current methods for correlation do not exploit multiple, correlated traits to improve power to find relationships between two high-dimensional sets of observations. The PI will develop computationally tractable models and robust inference methods for structured latent variable models in the presence of substantial observation noise. 2) Sparse, nonlinear regression for prediction by exploiting nonaddictive effects. Standard predictive models for genomics assume that associations are sparse and additive across predictors; nonlinear terms are not regularized appropriately. The PI will develop a predictive model that robustly recovers variants with additive and nonadditive effects. 3) Causal inference to study the mechanism of genetic regulation of disease. Current models of causal inference in genomics make unrealistic assumptions and fail to exploit modern machine learning approaches to nonlinearity, regularization, and approximate inference. The PI will develop a hierarchical model for causal analysis to pinpoint the cellular mechanisms of disease.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Princeton University
United States
Zip Code