Researchers in human genetics have now access to unprecedented amounts of genetic information characterizing how truly different we are from one another. From a Computer Science and Applied Mathematics perspective, the resulting datasets can be thought of as matrices, with the rows representing individuals and the columns representing loci in the genome that correspond to common or rare polymorphisms. Analyzing such datasets, Genome Wide Association Studies (GWAS) have reported over 10,000 strong associations between genetic variants and complex traits. However, tools that allow efficient analysis of very large scale datasets are still missing. Extracting useful information from such datasets promotes the progress of science and, at the same time, advances public health, prosperity, and welfare. This project will bridge the gap between state-of-the-art algorithms developed in the theoretical computer science community and the application of such algorithms to the analysis of the increasingly larger volume of datasets in the human genetics community.

This project will explore how randomized linear algebra, from a theoretical and practical standpoint, can be used to speed human genetics data analytics. The first research direction will investigate Linear Mixed Models or LMMs: LMMs form a linear model of the genetic effects on the phenotype of interest. Randomized linear algebra tools will be used to speed up the solution of the resulting optimization problem, without sacrificing accuracy. The second research direction will investigate Polygenic Risk Scores (PRS), which typically operate by first selecting a large number of genetic markers (often in the tens of thousands) out of all available markers (often in the many millions) using single marker significance tests. This feature selection stage is followed by building regression models on the selected markers to predict phenotypes. Randomized linear algebra tools will be used to speed up PRS approaches, while preserving generalization accuracy. Finally, the third research direction, will explore how the particular structure of population genetics datasets can be leveraged in order to design improved randomized linear algebra tools for the analysis of human genetics datasets. The investigators will disseminate their results to a broad community of applied mathematicians, theoretical computer scientists, and population geneticists. They both participate in population genetics conferences and workshops and publish in high-profile journals in population genetics, as well as in conferences and workshops in Computer Science. The investigators will additionally disseminate this knowledge to graduates and undergraduates. They will involve under-represented groups in their research activities, leveraging their prior track record of involving such groups in cutting-edge research.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
2006929
Program Officer
Amarda Shehu
Project Start
Project End
Budget Start
2020-09-01
Budget End
2023-08-31
Support Year
Fiscal Year
2020
Total Cost
$316,054
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907