Whole-genome association testing by genotyping common SNPs has promise for identification of genetic variants that are causal to elevated risk of complex disorders, but there is pressing need to apply deep resequencing to the regions found by these studies to further understand the disease association. Full sequence data in a population sample will no longer suffer from problems of ascertainment bias, and the full spectrum of population genetic models may be fitted to the data. But there remains a serious challenge to identify optimal means of establishing association between sequence variants and disease risk, and we will pursue four aims toward this goal.
Specific Aim 1 will consider the role of rare variants, starting with the challenge of calling singleton heterozygous sites in large samples and dealing with errors in these calls. Rare variants do not provide statistical power to be tested for phenotype associations individually, but a variety of tests of collective effects of rare variants are proposed. The basic idea of considering the collections of genotypic attributes that distinguish the phenotypic tails of the distribution of measured genotypes will be explored extensively.
Specific Aim 2 will consider the case of dense resequencing of specific candidate genome regions in case-control and cohort studies. Both unphased multi-site genotype data and phased haplotypes inferred from the genotype data represent samples from a population that may or may not fit well-studied population genetic and demographic models. The causal model connecting phenotype to genotype likely includes an integration of effects of multiple SNPs, including possibly highly non-additive effects. We propose a likelihood ratio based method that calculates the probability of observing the phenotypic data under a specific genetic model that can account for the combined effect of multiple SNPs.
Specific Aim 3 will extend the model of Aim 2 to a Bayesian setting, applying Markov Chain Monte Carlo techniques for sampling from the posterior distribution of effects. This will allow the test to be applied to much larger data sets, including resequencing of regions after genome-wide association studies.
Specific Aim 4 develops an improved and flexible Bayesian association mapping approach that can integrate disparate sources of data (such as age, intermediate phenotype, environment, etc.) to estimate the inflation of risk of disease for single or combinations of genetic variants, environmental conditional, age or combinations of factors. For all four aims, approaches will be tested with sample resequencing data as well as genotypic data generated by simulation of the coalescent with recombination under realistic human demographic models. In both cases phenotypes will be specified by a variety of genetic models. Data from the Framingham Heart Study, from GAIN genome-wide studies, and from the Sanger Institute data on transcript abundance of HapMap samples will be used to test the methods. DNA resequencing of large samples of individuals from case/control and cohort studies will yield information about associations with disease risk only if the properties of statistical models that describe the causal associations between genotype and phenotype are fully explored. This project is designed to develop an analytical framework that uses the underlying structure of the genetic data (based on population genetic principles) to provide the maximum statistical power for inference of association when many SNPs within the gene may contribute a small, non-additive portion of the increased risk. These methods will also be extended to include prior information about molecular mechanisms of gene function, where available, as well as environmental contributions to disease risk.

Agency
National Institute of Health (NIH)
Institute
National Institute of Mental Health (NIMH)
Type
Research Project (R01)
Project #
5R01MH084695-03
Application #
7933841
Study Section
Special Emphasis Panel (ZMH1-ERB-C (06))
Program Officer
Bender, Patrick
Project Start
2008-09-28
Project End
2012-07-31
Budget Start
2010-08-01
Budget End
2012-07-31
Support Year
3
Fiscal Year
2010
Total Cost
$386,072
Indirect Cost
Name
Cornell University
Department
Biochemistry
Type
Schools of Arts and Sciences
DUNS #
872612445
City
Ithaca
State
NY
Country
United States
Zip Code
14850
Peter, Benjamin M; Huerta-Sanchez, Emilia; Nielsen, Rasmus (2012) Distinguishing between selective sweeps from standing variation and from a de novo mutation. PLoS Genet 8:e1003011
Chen, Rong; Corona, Erik; Sikora, Martin et al. (2012) Type 2 diabetes risk alleles demonstrate extreme directional differentiation among human populations, compared to other diseases. PLoS Genet 8:e1002621
Arguello, J Roman; Connallon, Tim (2011) Gene duplication and ectopic gene conversion in Drosophila. Genes (Basel) 2:131-51
Yi, Xin; Liang, Yu; Huerta-Sanchez, Emilia et al. (2010) Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329:75-8
Manolio, Teri A; Collins, Francis S; Cox, Nancy J et al. (2009) Finding the missing heritability of complex diseases. Nature 461:747-53
Pollard, Katherine S; Serre, David; Wang, Xu et al. (2008) A genome-wide approach to identifying novel-imprinted genes. Hum Genet 122:625-34
Clark, Andrew G; Boerwinkle, Eric; Hixson, James et al. (2005) Determinants of the success of whole-genome association testing. Genome Res 15:1463-7