Phasing, defined as the estimation of haplotypes from diploid genotype data, is a fundamental problem in medical and population genetics. Phasing is a key preprocessing step for genotype imputation algorithms employed in genome-wide association studies of diseases and complex traits, and is also important for mapping molecular QTL using allele-specific reads, detecting clonal mosaicism, inferring population structure, and detecting natural selection. Considerable resources have been invested into developing accurate phasing algorithms, but currently, unsolved challenges include: (i) incorporating large reference panels, such as the Haplotype Reference Consortium, to improve phasing accuracy (reference-based phasing), and (ii) phasing extremely large cohorts using within-cohort data (cohort-based phasing). Here, we propose an exploratory two-year research program, in which we will develop methods and software for both reference-based phasing, and cohort-based phasing, using a new data structure based on the Positional Burrows-Wheeler Transform (PBWT).
We aim to make fast and accurate phasing methods and software freely available to all researchers via public phasing servers. We will also explore the early and conceptual stages of developing PBWT-based methods for reference-based imputation as well. Our team has multiple strengths: our statistical and computational expertise; our track record of producing practical, publicly-available software packages for a broad range of applications in statistical genetics that are widely used by the community, and our data-driven approach to methods research. We will guide our methods development using data from 500,000 samples from the UK Biobank, and will work closely with the Haplotype Reference Consortium (see letters of support).

Public Health Relevance

Statistical phasing, defined as the use of statistical methods to partition an individual's genome into its maternal and paternal components, is a problem of fundamental importance in medical genetics. Association studies that associate genetic variants to disease make use of statistical phasing in order to produce a more complete and accurate catalog of the genetic variants that each individual in the study contains. In this proposal, we will develop new statistical methods for conducting statistical phasing in very large data sets that are faster and more accurate than previous methods, helping association studies to succeed.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Exploratory/Developmental Grants (R21)
Project #
1R21HG009513-01
Application #
9293785
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Brooks, Lisa
Project Start
2017-06-03
Project End
2019-05-31
Budget Start
2017-06-03
Budget End
2018-05-31
Support Year
1
Fiscal Year
2017
Total Cost
$238,438
Indirect Cost
$88,438
Name
Harvard University
Department
Public Health & Prev Medicine
Type
Schools of Public Health
DUNS #
149617367
City
Boston
State
MA
Country
United States
Zip Code
02115
Loh, Po-Ru; Genovese, Giulio; Handsaker, Robert E et al. (2018) Insights into clonal haematopoiesis from 8,342 mosaic chromosomal alterations. Nature 559:350-355