This project will develop a series of computational tools that exploit the power of haplotype-based models for the analysis of population genomics data. The development of such tools is particularly important as advances in sequencing have now made it routine for sequence data to be gathered across full chromosomes. The multi-locus patterns of linkage disequilibrium that are present in haplotype data are informative about a range of important processes in population genetics. Leveraging the information in haplotypes is methodologically challenging, and for many specific problems the appropriate analysis tools do not yet exist. In response, our research will develop haplotype-based models in four major directions. First, we will develop haplotype-based models to infer recombination rates using genetic data from admixed individuals. The key principle is that ancestry switch points in admixed individuals can be used to infer recent recombination events. Our work will produce a software package for inference of recombination rates based on genome-wide single-nucleotide polymorphism data, and a separate simulation package for generating data with which to test the method. A key innovation will be developing and testing a version of this approach that can handle multi-way (>2 source population) admixtures. Second, we will use haplotype-informed approaches to improve the power of complex trait mapping approaches based on the """"""""evolve and resequence"""""""" paradigm. The improvement in power will come from using haplotype information embedded in the raw read data from pooled sequencing experiments. Again we will develop both inference software and simulations to test the inference methods. Third, we will investigate to what extent purifying selection has shaped haplotype diversity in human populations. The expectation is that segregating deleterious variants will show reduced haplotype diversity, much as adaptive variants do. This signature has largely been unexplored and we will develop theoretical, empirical, and simulation-based approaches to establish whether this property exists and how it can be used to infer the strength of purifying selection in human population genetic data. Finally, we will derive a novel form of the conditional sampling distribution (CSD) for a haplotype. The application of CSDs in population genetics has been very fruitful, even though the approach is in its infancy. We will develop an approach that leads to a more accurate CSD. The new CSD will also open the door to extensions for computing haplotype probabilities in models with non-equilibrium demography and/or population structure. Throughout the project there will be an emphasis on software development for the broader population genomics community, and on overcoming computational and algorithmic challenges that arise commonly with haplotype-based models. The contributions are essential for pushing forward population genetics into the genomic era. Project Relevance This project will contribute to the basic toolkit population geneticists use to extract information from large genomic datasets and will enhance research on a number of applied areas with practical relevance. In particular we will develop tools that empower researchers to measure recombination, map complex traits, and understand the fitness consequences of human genetic variation. These areas are relevant to disease trait mapping, genetic disease etiology, and historical demography. Finally, we expect the algorithms developed will be useful either directly or with minor adjustment to closely related problems beyond those detailed in the project. As an example, our algorithms for haplotype frequency estimation in pooled sequences are closely related to problems for identifying the abundance of pathogenic strains in sequencing of blood DNA.
The proposed research will develop a series of computational tools that exploit the power of explicit haplotype- based models for the analysis of population genomics data. The applications of these tools will empower efforts to (1) estimate recombination in admixed populations, (2) map the genetic basis of complex traits using the evolve and resequence paradigm, (3) quantify purifying selection in human populations and (4) improve basic models of haplotype variation.
Chiang, Charleston W K; Marcus, Joseph H; Sidore, Carlo et al. (2018) Genomic history of the Sardinian population. Nat Genet 50:1426-1434 |
Smith, Joel; Coop, Graham; Stephens, Matthew et al. (2018) Estimating Time to the Common Ancestor for a Beneficial Allele. Mol Biol Evol 35:1003-1017 |
Wong, Emily H M; Khrunin, Andrey; Nichols, Larissa et al. (2017) Reconstructing genetic history of Siberian and Northeastern European populations. Genome Res 27:1-14 |
van den Berg, Marten E; Warren, Helen R; Cabrera, Claudia P et al. (2017) Discovery of novel heart rate-associated loci using the Exome Chip. Hum Mol Genet 26:2346-2363 |
Novembre, John; Peter, Benjamin M (2016) Recent advances in the study of fine-scale population structure in humans. Curr Opin Genet Dev 41:98-105 |
Peter, Benjamin M (2016) Admixture, Population Structure, and F-Statistics. Genetics 202:1485-501 |
Chiang, Charleston W K; Ralph, Peter; Novembre, John (2016) Conflation of Short Identity-by-Descent Segments Bias Their Inferred Length Distribution. G3 (Bethesda) 6:1287-96 |
Ding, Jun; Sidore, Carlo; Butler, Thomas J et al. (2015) Assessing Mitochondrial DNA Variation and Copy Number in Lymphocytes of ~2,000 Sardinians Using Tailored Sequencing Analysis Tools. PLoS Genet 11:e1005306 |
Day, Felix R (see original citation for additional authors) (2015) Large-scale genomic analyses link reproductive aging to hypothalamic signaling, breast cancer susceptibility and BRCA1-mediated DNA repair. Nat Genet 47:1294-1303 |
Han, Eunjung; Sinsheimer, Janet S; Novembre, John (2015) Fast and accurate site frequency spectrum estimation from low coverage sequence data. Bioinformatics 31:720-7 |
Showing the most recent 10 out of 22 publications