Over the past decade, genome-wide association studies (GWAS) have discovered genetic variants associated with numerous diseases as well as other complex phenotypes. Despite their success, major gaps remain in our understanding of how genetic changes affect phenotype. These gaps, coupled with advances in high-throughput technologies to measure genetic variation, have motivated GWAS of increasingly larger scale. However, the statistical and computational challenges posed by the scale and complexity of these studies present a critical bottleneck in realizing their promise. These recent advances in scalable ML provide the potential for paradigm-shifting advances in the field of GWAS. However, these concepts have yet to be rigorously explored in the context of the GWAS modeling and testing problems. Exploring the intersection of these domains introduces fundamentally new statistical and computational challenges. The team will develop a suite of modeling and testing methods that target massive modern genomics datasets. The techniques that we will build upon include low-rank matrix approximation, kernel methods and matrix completion. They will also provide open-source software tailored to parallel and distributed computing environments to facilitate wide-spread adoption of methods.

Exploring GWAS through the lens of scalable machine learning introduces several research directions and requires the development of novel algorithms and analyses. Firstly, the focus of much scalable ML research has been on the statistical task of prediction, while GWAS inference problems also emphasize hypothesis testing and parameter estimation. Characterizing the behavior of scalable ML methods in these novel settings is a challenging open problem. The team will develop principled GWAS modeling and testing methods. The results to also be of great interest to the scalable ML community. Secondly, while scalable ML techniques are designed to be general purpose and domain-agnostic, the GWAS setting introduces rich biologically-motivated domain knowledge that needs to be leveraged to improve the quality of inference. Statistical models that are able to encode this prior knowledge while still permitting efficient inference will be developed. Ultimately the algorithms will be implemented as efficient parallel and distributed algorithms for these core modeling and testing problems, and develop robust open-source implementations that leverage modern computing infrastructure.1The proposed methods will dramatically improve the scalability of current GWAS analyses, on the one hand, while enabling the development of increasingly realistic genomic models, on the other. Collaborations and open-source artifacts will enable the wide-spread adoption of these methods by the human genetics community. This project will lead to a closer interaction of the genomics and machine learning communities at UCLA and outside.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1705121
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2017-07-01
Budget End
2021-06-30
Support Year
Fiscal Year
2017
Total Cost
$992,927
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095