Phasing, defined as the estimation of haplotypes from diploid genotype data, is a fundamental problem in medical and population genetics. Phasing is a key preprocessing step for genotype imputation algorithms employed in genome-wide association studies of diseases and complex traits, and is also important for mapping molecular QTL using allele-specific reads, detecting clonal mosaicism, inferring population structure, and detecting natural selection. Considerable resources have been invested into developing accurate phasing algorithms, but currently, unsolved challenges include: (i) incorporating large reference panels, such as the Haplotype Reference Consortium, to improve phasing accuracy (reference-based phasing), and (ii) phasing extremely large cohorts using within-cohort data (cohort-based phasing). Here, we propose an exploratory two-year research program, in which we will develop methods and software for both reference-based phasing, and cohort-based phasing, using a new data structure based on the Positional Burrows-Wheeler Transform (PBWT).
We aim to make fast and accurate phasing methods and software freely available to all researchers via public phasing servers. We will also explore the early and conceptual stages of developing PBWT-based methods for reference-based imputation as well. Our team has multiple strengths: our statistical and computational expertise; our track record of producing practical, publicly-available software packages for a broad range of applications in statistical genetics that are widely used by the community, and our data-driven approach to methods research. We will guide our methods development using data from 500,000 samples from the UK Biobank, and will work closely with the Haplotype Reference Consortium (see letters of support).
Statistical phasing, defined as the use of statistical methods to partition an individual's genome into its maternal and paternal components, is a problem of fundamental importance in medical genetics. Association studies that associate genetic variants to disease make use of statistical phasing in order to produce a more complete and accurate catalog of the genetic variants that each individual in the study contains. In this proposal, we will develop new statistical methods for conducting statistical phasing in very large data sets that are faster and more accurate than previous methods, helping association studies to succeed.
Loh, Po-Ru; Genovese, Giulio; Handsaker, Robert E et al. (2018) Insights into clonal haematopoiesis from 8,342 mosaic chromosomal alterations. Nature 559:350-355 |