The ability to generate sequence data is rapidly becoming a reality. Sequencing efforts are already underway at candidate gene regions surrounding association peaks identified by genome-wide association studies (GWAS), paving the way for """"""""whole-exome"""""""" and, ultimately, whole-genome sequencing studies. Comprehensive sequencing has the potential to reveal a vast trove of low frequency variants, but most statistical association methods used for GWAS are likely inadequate because they are targeted towards common variants and have been optimized for identifying associations at a single variant at a time, and therefore, do not account for multiple variants acting at the same locus. For sequencing studies to attain their full potential, the development of new statistical methods will be critical. We propose to develop new methods for both targeted and genome-wide sequencing approaches.
In Specific Aim 1 we will evelop statistical methods for identifying causal variants inside a targeted region, such as a GWAS peak or candidate gene. DNA sequencing provides a complete picture of genetic variation, enabling the localization of association signal(s) in order to identify true causal alleles against a background of correlated variants due to linkage disequilibrium. We will design statistical strategies for finding causal variants underlying association peaks. We will consider the presence of multiple causal alleles at a locus.
In Specific Aim 2 we will develop statistical methods for sequencing studies to optimally capture the association signal arising from multiple rare variants acting within the same disease gene. The initial focus will be on candidate gene sequencing with an eye towards whole-exome and even whole-genome sequencing. Associations of individual rare alleles with disease are difficult to detect because low-frequency alleles have limited power in single-variant association tests. Therefore, we will develop methods combining multiple rare variants from the same gene (or pathway) and treat genes (pathways) rather than individual alleles as the unit for the association test. Recent studies demonstrate that genes underlying certain quantitative phenotypes display an excess of rare coding variation in individuals at one phenotypic extreme. In addition to combining multiple rare variants in a single test, we will also develop methods incorporating both rare and common variants, which will be important when whole- genome sequencing eventually becomes practical.
In Specific Aim 3 we will assess the power of both targeted and genome-wide approaches and generate study design recommendations, using a population genetic model based on allele frequency distributions from empirical sequencing data sets. We will make recommendations on sequencing strategies, sample sizes, and inclusion of specific populations. All of our power calculations and recommendations will critically depend on assumptions about allele frequency distributions, which we will rigorously model using empirical sequence data. Our population genetic model will incorporate complex demographic histories, recombination and natural selection in addition to mutation and genetic drift. RESEARCH NARRATIVE The study of human genetic variation has already begun to pay big dividends, as genome- wide association studies (GWAS) focusing on common genetic variation have identified risk variants for numerous complex diseases. However, for most diseases the fraction of genetic heritability explained by these findings is extremely small, motivating deep resequencing studies, which will be able to identify rare risk variants. These resequencing studies will require new statistical methods that will have great potential for furthering our understanding of disease etiology, leading to possible drug targets, and may also be useful for diagnostic testing in healthy individuals.
Showing the most recent 10 out of 20 publications