Hundreds of thousands of human genomes are being sequenced, enabled by profound reduction in the costs of sequencing. These data offer unprecedented opportunities to ascertain how the human genome varies and how this variation shapes human biology. While fine-scale sequence variation is today readily recognized by mature analysis methods, larger-scale forms of genome variation ? especially those with many structurally distinct alleles ? are challenging to recognize, analyze, and incorporate into association analyses. We seek to understand how human genomes vary at these scales and how this variation contributes to human phenotypes. We believe that it is possible to ascertain far more genetic variation in genome sequence data than is visible with analysis methods today. There is vast under-utilized information in the statistical patterns that large collections of sequence reads form across individuals, families and populations, and in further utilizing the haplotypes that multi-allelic variants form together with SNPs and other variants. Our focus in this work will be on two large, intriguing classes of genome variation that we seek to incorporate into routine genome analysis. One class involves multi-allelic CNVs, in which a genomic segment (from one to several hundred kilobases size) exists in a wide range of copy numbers (such as 2?10) per diploid human genome, often varying in fine-scale sequence as well as copy number. Another class involves higher-copy-number variable-number-of-tandem-repeat (VNTR) polymorphisms, in which a shorter genomic sequence (tens to thousands of base pairs) exists in a wider range of copy numbers (up to scores or even hundreds of copies) per diploid genome. We will advance analysis methods that make it possible to measure sequence variation at these loci, identify the structural alleles from which this variation arises, and analyze the relationships of such variation to human phenotypes. We will create and distribute research software and data resources, such as reference haplotypes, that enable human geneticists to incorporate such loci into association and fine-mapping analyses. We will also assess the contribution of these kinds of variation to quantitative phenotypes that are being collected in large population cohorts. We hope that this work contributes to many discoveries about the genetic and biological basis of disease.
Genetic variation that associates with disease can offer powerful clues about the biological basis of disease. Human genomes vary at scales large and small; while fine-scale DNA variation is readily recognized by today's analysis methods, many kinds of complex, larger-scale variation have been challenging to recognize and analyze. The goal of our work is to develop new and powerful ways to use emerging genome-sequencing data to understand how human genomes vary at these larger scales, and to identify which of these variations underlies risk of disease.