Massively-parallel ("next-generation") shotgun DNA sequencing projects will provide the highest resolution to date for genetic variation of human populations. This new technology offers great promise for interrogating the genetic etiology of complex disease. However, with this promise come challenges. These new sequencing methods are prone to nontrivial error rates and sparse coverage of mapped reads, confounding polymorphism discovery and genotyping. Copy number variation must often be inferred indirectly. The massive size of these data sets requires rapid and scaleable analytic approaches. In this proposal, we present statistical methods to address these challenges directly, using computationally tractable models for population genetic variation. Our methods take account of the dependence among nearby alleles (linkage disequilibrium) with a clusterbased model for haplotype variation, and utilize this information to aid inferences about the underlying genetic architecture of the samples. Specifically, we propose to call genotypes and detect novel polymorphic loci from next- generation shotgun sequence data, detect rare disease risk alleles for follow-up sequencing studies, and simultaneously model single nucleotide and copy number polymorphism in population data to facilitate studies of association between phenotype and genotype. Our experienced team of medical and statistical geneticists have the technical expertise and access to data sets necessary for achieving these aims. We will implement our methods in our widely-used software package fastPHASE.

Public Health Relevance

High throughput DNA sequencing technology is providing unparalleled detail of human genetic variation. This will allow finer resolution in locating disease genes that affect human health and disease. Both the large quantity and the uneven quality of this new technology demand new statistical methods for inference, risk assessment and eventually clinical translation.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG005859-04
Application #
8686916
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
2011-09-01
Project End
2016-05-31
Budget Start
2014-06-01
Budget End
2015-05-31
Support Year
4
Fiscal Year
2014
Total Cost
$376,477
Indirect Cost
$111,293
Name
University of Texas MD Anderson Cancer Center
Department
Public Health & Prev Medicine
Type
Schools of Medicine
DUNS #
800772139
City
Houston
State
TX
Country
United States
Zip Code
77030
Wang, Gao T; Li, Biao; Santos-Cortez, Regie P Lyn et al. (2014) Power analysis and sample size estimation for sequence-based association studies. Bioinformatics 30:2377-8
Wang, Gao T; Peng, Bo; Leal, Suzanne M (2014) Variant association tools for quality control and analysis of large-scale sequence and genotyping array data. Am J Hum Genet 94:770-83
Xia, Rui; Vattathil, Selina; Scheet, Paul (2014) Identification of allelic imbalance with a statistical model for subtle genomic mosaicism. PLoS Comput Biol 10:e1003765
Xu, Hanli; Guan, Yongtao (2014) Detecting local haplotype sharing and haplotype association. Genetics 197:823-38
Vattathil, Selina; Scheet, Paul (2013) Haplotype-based profiling of subtle allelic imbalance with SNP arrays. Genome Res 23:152-8
San Lucas, F Anthony; Wang, Gao; Scheet, Paul et al. (2012) Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools. Bioinformatics 28:421-2