The advent of modern genotyping and sequencing technologies has revolutionized human genetics research, allowing researchers to truly understand how different we are from one another. Large datasets describing the common patterns of human genetic variation may be easily thought of as matrices, with the rows representing individuals and the columns representing loci in the genome that correspond to common polymorphisms. The broader impact of such datasets cannot be overemphasized: they are a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors, as well as understanding the evolutionary and biological history of our species. Extracting useful information from such datasets promotes the progress of science and, at the same time, advances national health, prosperity and welfare. This project will bridge the gap between state-of-the-art algorithms for data analysis developed in the theoretical computer science and applied mathematics communities and the application of such algorithms to the analysis of the increasingly larger volume of datasets in the human genetics community.

In the context of this project, first, from an algorithmic perspective, the project team will design and analyze novel algorithms for three prototypical, fundamental research topics that combine linear algebra and randomization, namely sparse Principal Components Analysis, matrix completion, and linear (or kernel) discriminant analysis. All three topics have been widely popular in the theoretical computer science, machine learning, and applied mathematics communities. Yet these research topics have been essentially overlooked by the population genetics community. Second, from a population genetics perspective, the team will apply the developed algorithms to gain novel insights regarding population structure, ancestry informative markers, and natural selection, as well as improve imputation methods and Genome-Wide Association Studies (GWAS) data analysis. All three methods will be evaluated on population genetics datasets that are available to the PIs. The project will train graduate students and will disseminate the results of the research to a broad community of applied mathematicians, theoretical computer scientists, and population geneticists.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1715202
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2017-09-01
Budget End
2021-08-31
Support Year
Fiscal Year
2017
Total Cost
$499,984
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907