The Human Genome Project and subsequent projects such as 1000 Genomes, Genome Sequencing Program (GSP), and Trans-Omics Precision Medicine (TOPMed) are providing powerful resources for studying the genetic basis of human diseases. Combining these resources and technologies with the development of new statistical and computational methods have in the last decade led to identification of thousands of loci associated with disease-related phenotypes, primarily through array-based genome-wide association studies (GWAS), empowered by genotype imputation from sequence-based haplotype panel. However, serious problems remain when analyzing these data: (1) As short read sequencing data only provides unphased genotype data, methods for statistical phasing are used to allow advanced analyses and to generate reference haplotypes for genotype imputation. However, current methods to phase sequence data result in several thousand switch errors per genome. These phasing errors in turn limit the accuracy of genotype imputation and hamper our ability to study haplotype-aware disease models such as compound heterozygotes. (2) Due to the abundance of rare variants, it is necessary to identify high-interest variants to obtain powerful test statistics. Within exons, the genetic code provides some of the necessary information, but for most the genome we have very little information that allows us to prioritize variants. (3) While samples sequenced from diverse and admixed populations are becoming more common, few methods are designed to make use of the unique properties of such data. For example, the distribution of local ancestry in admixed samples generate unique haplotype structure that can be informative about the underlying phasing. Here we propose a set of novel methods that will address these challenges: recognizing that in very large datasets most sequences will have a recent common ancestor with at least one other sequence and that these closely related sequences will share long segments (>1 cM) identical by descent (IBD). These IBD segments provides information about the phasing of the underlying variants similar to large sibships. Moreover, the length of the IBD segment provides information about the age of variants located on the IBD segment. As young variants are more likely to be under selection, IBD length can be used to prioritize functional noncoding variants. We also aim to leverage the long-distance correlation of genotypes in admixed samples to identify phasing errors in admixed samples. As phasing errors also change the local ancestry of a sample in individuals of heterozygous ancestry, identifying these breaks allows identifying and correcting phasing errors. We will develop statistical models that leverage these conceptual ideas and implement these methods in algorithms efficient enough to be applied to sample sizes >100,000. We will use our algorithms to annotate and re-phase existing large sequencing datasets and thus improve commonly used imputation reference panels. All software developed in this proposal will be publicly released in user-friendly, well-documented packages.
Few existing methods leverage the abundant sequencing data generated by the rapidly accelerating throughput of sequencing technologies. Here we develop methods that analyze long-range haplotypes shared in large samples to improve phasing and to identify functional variants from signals of selection. We will provide scalable and user-friendly implementations for all developed tools.!