Genome-wide association studies (GWAS) have improved our understanding of the genetic architectures of many complex diseases and hold the promise of identifying genomic loci of causal variants and enabling accurate genetic risk prediction. However, because most traits of medical interest are influenced by a multitude of genetic factors, each of which explain only a small fraction of heritability, cohort sizes on the scale of hundreds of thousands of individuals will be necessary to provide the statistical power required to detect these elusive associations. This proposal aims to develop fast and powerful statistical methods addressing key challenges that arise in modeling such large-scale data sets: correcting for subtle confounding from population stratification or cryptic relatedness among study participants while maintaining computational tractability. The current state of the art approach to association testing uses linear mixed models to simultaneously model the effects of all markers while accounting for sample structure. Existing mixed model techniques are computationally expensive, however, and also assume that all markers have nonzero effects. This proposal aims to extend mixed model methods by developing and implementing a new well-calibrated mixed model statistic that can be computed very quickly and tailored to more realistic genetic architectures. The first specific aim is to develop a novel method that analyzes linkage disequilibrium patterns to calibrate mixed model association test scores, distinguishing genome-wide inflation of test statistics due to sample structure from perceived inflation that is actually the true result of many causal loci. This method will safeguard against the alternative dangers of false positive associations from confounding or power loss from overly conservative calibration.
The second aim i s to develop a fast algorithm that applies modern iterative methods for numerical linear algebra to reduce the computational complexity of mixed model association testing to linear in the data size. This advance will enable mixed model analysis to remain feasible as study sizes increase, unlocking associations from rare or small-effect variants.
The third aim i s to extend the method to model genetic architectures in which most markers have no disease association - as is widely believed - thereby improving statistical power. All of these techniques will be validated in simulation, implemented in software released to the scientific community, and applied to real GWAS data sets to search for additional associations that reach significance.
Although genome-wide association studies have improved our understanding of the genetic bases of many complex diseases, most traits of interest have hundreds or thousands of causal factors that are extremely difficult to detect. This proposal aims to advance the statistical methodology used to detect associations by improving statistical power and reducing the computational burden of large-scale data analysis. The techniques developed will enable continued discovery of disease-associated genetic variants and more accurate prediction of genetic risk. Fast and powerful extensions of mixed model methods for GWAS Although genome-wide association studies (GWAS) have been successful in improving our understanding of the genetic architectures of many complex diseases, most traits of interest are highly polygenic - i.e., influenced by many genetic factors - and thus challenging to decipher: in a typical scenario, the tens or hundreds of associated loci that have been identified to date each explain a small percentage of phenotypic variance, collectively accounting for only a fraction of estimated heritability. In order to detect associations of such small magnitudes, it s critical to maximize the power available from available samples and to account for subtle confounders such as population stratification and cryptic relatedness. The current state of the art approach uses linear mixed models to simultaneously model the effects of all markers while accounting for sample structure via a genetic relatedness matrix. Existing techniques for mixed models are computationally expensive, however, and also unrealistically assume implicitly that all markers have effect sizes drawn from identical independent normal prior distributions. This proposal aims to extend mixed model methods by developing a new well-calibrated mixed model statistic that can be computed very quickly and tailored to the hypothesized genetic architecture underlying a trait to improve power. The new method will be implemented in publicly available, open-source software and applied to analyze real data sets of medical interest. To achieve these aims, I will employ my strong mathematical background and previous experience developing computational methods and software for population genetics while receiving new training in medical genetics.