This proposal aims to develop statistical models and computational methods to quantify the degree of local haplotype sharing between two individuals at an arbitrary marker, and to provide an in-depth understanding on how haplotypes affect disease phenotypes directly, or serve as genetic background for polymorphic sites to affect phenotypes differentially, and how haplotype background serves as a medium for rare variants to aggregate and affect phenotypes. An investigation into these problems will provide insights into etiology of complex traits and computational tools for disease association mapping, a strategic goal that NIH has invested heavily, and will shed light on the lingering puzzle of the missing heritability. Our haplotype method reinvents the haplotype association mapping to provide several benefits -- no phasing requirement, no sliding-window requirement, an ability to work directly with next-generation sequencing data, and enhanced interpretability of association findings. Because each SNP serves as a core SNP for its local haplotypes, our haplotype method has the same number of tests as the single SNP analysis. Detecting genetic associations accounting for haplotype backgrounds at each marker will shift paradigm for the large-scale genetic association studies. The single-marker test assumes that an allele has the same effect, independent of its haplotype background. Our fundamental assumption is that, depending on its local haplotype background, an allele can have a positive effect, zero effect, or a negative effect towards a phenotype (for ex- ample, due to local epistatic interactions). When all individuals share the same local haplotype background, our assumption reduces to the conventional assumption of homogeneous effect; when individuals have different local haplotype backgrounds, our assumption generates more power. For example, when an allele has a large effect when presenting on a particular haplotype background and zero effect otherwise, traditional analysis, which ignores the haplotype background, will fail to detect the association because the signal is diluted by individuals with other haplotype backgrounds. On the other hand, if correctly quantified, haplotype background can control and reduce the noise introduced by those individuals. Aggregating rare variants within an LD block makes the aggregation approach applicable to whole genome sequencing data. Current methods aggregate rare variants based on the gene annotation and are difficult to extend to whole genome sequencing data. Our method can quantify LD blocks, allowing for aggregation of rare variants in a LD block. This not only avoids arbitrariness in aggregating variants, but also contributes to interpret- ing associations. On the other hand, current methods aggregate rare variants ignoring the variants' haplotype background. This will inevitably lose power. An extreme example is analyzing sequencing data of the admixed samples, where ignoring the haplotype background is equivalent to not controlling for the local ancestry. Thus, we propose methods to aggregate the rare variants according to their haplotype background.

Public Health Relevance

Our proposed methods provide novel statistical methods and computational tools to analyze the existing SNP array data sets and the upcoming exome and whole genome sequencing data sets, increasing the association findings and adding value to current and future investments. Incorporating local haplotype sharing to detect genetic associations have the potential to detect novel associations, regions that harbor allelic heterogeneity, and associations that have large conditional effect sizes. Thus, our methods are extremely valuable to understanding disease etiology and pinpointing casual variants. Together, these will have profound impact on our ability to produce better treatment solutions, better prevention and improved healthcare.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
7R01HG008157-05
Application #
9793542
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
2018-08-01
Project End
2019-02-28
Budget Start
2018-08-01
Budget End
2019-02-28
Support Year
5
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Duke University
Department
Type
DUNS #
044387793
City
Durham
State
NC
Country
United States
Zip Code
27705
Zhou, Quan; Guan, Yongtao (2018) On the Null Distribution of Bayes Factors in Linear Regression. J Am Stat Assoc 113:1362-1371
Zhou, Quan; Zhao, Liang; Guan, Yongtao (2016) Strong Selection at MHC in Mexicans since Admixture. PLoS Genet 12:e1005847
Qi, Hongjian; Dong, Chengliang; Chung, Wendy K et al. (2016) Deep Genetic Connection Between Cancer and Developmental Disorders. Hum Mutat 37:1042-50