This project, a collaboration between statisticians and a statistical geneticist, focuses on the development of statistical theory and methods for the analysis of data from genome-wide association studies (GWAS). Over the past decade, while GWAS have been very successful in detecting genetic variants that affect complex human traits/diseases, these discoveries have only accounted for a small portion of the genetic factors. Recently, significant progress has been made using statistical analysis based on a class of statistical models called mixed effects models. However, there is a gap in understanding why the method works, because, in a way, the statistical model used in the analysis is misspecified. This project aims to fill the gap by developing new theory and methods, and evaluating the methods through applications to real data. The project will promote teaching, training and learning, broaden the participation of students from under-represented groups, and build research networks between institutions. The research will be of great interest to many other areas of science, and the results will be widely disseminated in subject matter domain journals.

In the past decade, more than 24,000 single-nucleotide polymorphisms (SNPs) have been reported to be associated with at least one trait/disease at the genome-wide significance level. However, these significantly associated SNPs only account for a small portion of the genetic factors underlying complex human traits/diseases, referred to as "missing heritability" in the genetics community. Recently, significant progress has been made in using the restricted maximum likelihood (REML) approach based on linear mixed models (LMM). While the REML approach appears to provide the right answer to many problems of practical interest, researchers have been puzzled by the fact that the LMM, under which the REML estimators are derived, is misspecified. In a recently published article, the investigators proved that the REML estimators of some important genetic quantities, such as heritability and the variance of the environmental error, are consistent despite the model misspecification. While this pioneering work led to a new field called misspecified mixed model analysis (MMMA), many theoretical and practical challenges remain unsolved. This project seeks to address the following problems: (1) extension of MMMA to correlated SNPs, (2) development of the asymptotic distribution of the REML estimator under misspecified LMM, (3) resampling methods for MMMA, (4) estimation of the number of nonzero random effects, and (5) extensions to multiple random effect factors and discrete traits. The research will also include software development to implement the methods.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1713120
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-07-01
Budget End
2021-06-30
Support Year
Fiscal Year
2017
Total Cost
$279,935
Indirect Cost
Name
University of California Davis
Department
Type
DUNS #
City
Davis
State
CA
Country
United States
Zip Code
95618