Genome-wide association studies (GWAS) have improved our understanding of the genetic architectures of many complex diseases and hold the promise of identifying genomic loci of causal variants and enabling accurate genetic risk prediction. However, because most traits of medical interest are influenced by a multitude of genetic factors, each of which explain only a small fraction of heritability, cohort sizes on the scale of hundreds of thousands of individuals will be necessary to provide the statistical power required to detect these elusive associations. This proposal aims to develop fast and powerful statistical methods addressing key challenges that arise in modeling such large-scale data sets: correcting for subtle confounding from population stratification or cryptic relatedness among study participants while maintaining computational tractability. The current state of the art approach to association testing uses linear mixed models to simultaneously model the effects of all markers while accounting for sample structure. Existing mixed model techniques are computationally expensive, however, and also assume that all markers have nonzero effects. This proposal aims to extend mixed model methods by developing and implementing a new well-calibrated mixed model statistic that can be computed very quickly and tailored to more realistic genetic architectures. The first specific aim is to develop a novel method that analyzes linkage disequilibrium patterns to calibrate mixed model association test scores, distinguishing genome-wide inflation of test statistics due to sample structure from perceived inflation that is actually the true result of many causal loci. This method will safeguard against the alternative dangers of false positive associations from confounding or power loss from overly conservative calibration.
The second aim i s to develop a fast algorithm that applies modern iterative methods for numerical linear algebra to reduce the computational complexity of mixed model association testing to linear in the data size. This advance will enable mixed model analysis to remain feasible as study sizes increase, unlocking associations from rare or small-effect variants.
The third aim i s to extend the method to model genetic architectures in which most markers have no disease association - as is widely believed - thereby improving statistical power. All of these techniques will be validated in simulation, implemented in software released to the scientific community, and applied to real GWAS data sets to search for additional associations that reach significance.

Public Health Relevance

Although genome-wide association studies have improved our understanding of the genetic bases of many complex diseases, most traits of interest have hundreds or thousands of causal factors that are extremely difficult to detect. This proposal aims to advance the statistical methodology used to detect associations by improving statistical power and reducing the computational burden of large-scale data analysis. The techniques developed will enable continued discovery of disease-associated genetic variants and more accurate prediction of genetic risk. Fast and powerful extensions of mixed model methods for GWAS Although genome-wide association studies (GWAS) have been successful in improving our understanding of the genetic architectures of many complex diseases, most traits of interest are highly polygenic - i.e., influenced by many genetic factors - and thus challenging to decipher: in a typical scenario, the tens or hundreds of associated loci that have been identified to date each explain a small percentage of phenotypic variance, collectively accounting for only a fraction of estimated heritability. In order to detect associations of such small magnitudes, it s critical to maximize the power available from available samples and to account for subtle confounders such as population stratification and cryptic relatedness. The current state of the art approach uses linear mixed models to simultaneously model the effects of all markers while accounting for sample structure via a genetic relatedness matrix. Existing techniques for mixed models are computationally expensive, however, and also unrealistically assume implicitly that all markers have effect sizes drawn from identical independent normal prior distributions. This proposal aims to extend mixed model methods by developing a new well-calibrated mixed model statistic that can be computed very quickly and tailored to the hypothesized genetic architecture underlying a trait to improve power. The new method will be implemented in publicly available, open-source software and applied to analyze real data sets of medical interest. To achieve these aims, I will employ my strong mathematical background and previous experience developing computational methods and software for population genetics while receiving new training in medical genetics.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Postdoctoral Individual National Research Service Award (F32)
Project #
5F32HG007805-03
Application #
9186420
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Colley, Heather
Project Start
2014-12-01
Project End
2017-07-31
Budget Start
2016-12-01
Budget End
2017-07-31
Support Year
3
Fiscal Year
2017
Total Cost
Indirect Cost
Name
Harvard University
Department
Public Health & Prev Medicine
Type
Schools of Public Health
DUNS #
149617367
City
Boston
State
MA
Country
United States
Zip Code
02115
Loh, Po-Ru; Genovese, Giulio; Handsaker, Robert E et al. (2018) Insights into clonal haematopoiesis from 8,342 mosaic chromosomal alterations. Nature 559:350-355
Loh, Po-Ru; Palamara, Pier Francesco; Price, Alkes L (2016) Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet 48:811-6
Loh, Po-Ru; Danecek, Petr; Palamara, Pier Francesco et al. (2016) Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet 48:1443-1448
Galinsky, Kevin J; Bhatia, Gaurav; Loh, Po-Ru et al. (2016) Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am J Hum Genet 98:456-472
Tucker, George; Loh, Po-Ru; MacLeod, Iona M et al. (2015) Two-Variance-Component Model Improves Genetic Prediction in Family Datasets. Am J Hum Genet 97:677-90
Loh, Po-Ru; Bhatia, Gaurav; Gusev, Alexander et al. (2015) Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat Genet 47:1385-92
Bulik-Sullivan, Brendan K; Loh, Po-Ru; Finucane, Hilary K et al. (2015) LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet 47:291-5
Loh, Po-Ru; Tucker, George; Bulik-Sullivan, Brendan K et al. (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet 47:284-90
Hayeck, Tristan J; Zaitlen, Noah A; Loh, Po-Ru et al. (2015) Mixed model with correction for case-control ascertainment increases association power. Am J Hum Genet 96:720-30
Lipson, Mark; Loh, Po-Ru; Sankararaman, Sriram et al. (2015) Calibrating the Human Mutation Rate via Ancestral Recombination Density in Diploid Genomes. PLoS Genet 11:e1005550