The shift in attention toward rare alleles and the concomitant need for DNA sequence data from large samples drawn from human populations has driven the need to accurately describe the patterns of human DNA sequence variation and to understand the forces that impact it. At the same time, SNP genotyping platforms are expanding in SNP density at the same time unprecedented sample sizes are accumulating from GWAS studies. In order to foster rigorous inferences about human variation and past human evolution from these data, we propose a series of investigations that center around four aims. First, we will develop novel statistical methods for population genetic inference from next-generation DNA sequence data. Starting from alignments of sequence reads from multiple individuals, we will develop methods of parameter estimation and hypothesis testing that integrate over likelihoods of genotypes conditional on the data. Optimal balance of sample size vs. sequencing coverage will be analyzed for several distinct experimental problems. The methods will be thoroughly tested and applied to several resequencing data sets to which we have access. Second, we will develop and extend methods for ancestry inference from SNP and genome sequence data of admixed individuals and employ them to infer past demographic history, including migration. Our method of ancestry inference based on Principal Components Analysis will be extended to accommodate data uncertainty and ascertainment bias. We will model a range of admixture scenarios from single-pulse to continuous influx in order to determine whether genetic data allows more refined inference of the past history of mixing of two ancestral populations. Third, we will develop methods for estimation of joint IBD relationships across multiple individuals. Existing methods take discrete genotype calls as a starting point, and do not accommodate platform-specific error. There is significant need to develop methods for inference of shared IBD regions genome-wide across multiple individuals in large population samples. Through a combination of heuristic approaches and graph- theory based computational algorithms, we will develop and test such methods. Finally, we will use IBD sharing inferred across individuals in a sample to estimate population genetic parameters in models of demography and selection. Just as demographic changes impact the site frequency spectrum of SNPs, so too will they impact the pattern of IBD sharing in a sample. Turning this problem around, we will develop approaches for inference of population genetic parameters, such as demography, rates of inbreeding, levels of purifying and positive selection, admixture and migration based only on the patterns of IBD sharing. These will be contrasted to approaches that use phased haplotype information for demography inference.

Public Health Relevance

This project aims to understand the population-level forces at play on the human genome by analysis of genome-wide SNP data and next-generation sequences using newly developed statistical methods. Estimation of model parameters from alignments of next-generation sequence reads will be done so as to accommodate base-calling uncertainty, and segment-wise inference of ancestry in admixed genomes will be applied to understand past admixture history. Identity-by-descent methods will be pursued to allow the most reliable inferences about demography, natural selection and other population forces acting on human genetic variation.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG003229-07
Application #
8187897
Study Section
Genetic Variation and Evolution Study Section (GVE)
Program Officer
Brooks, Lisa
Project Start
2004-05-21
Project End
2014-06-30
Budget Start
2011-09-14
Budget End
2012-06-30
Support Year
7
Fiscal Year
2011
Total Cost
$784,993
Indirect Cost
Name
Cornell University
Department
Biochemistry
Type
Schools of Earth Sciences/Natur
DUNS #
872612445
City
Ithaca
State
NY
Country
United States
Zip Code
14850
Racimo, Fernando; Gokhman, David; Fumagalli, Matteo et al. (2017) Archaic Adaptive Introgression in TBX15/WARS2. Mol Biol Evol 34:509-524
Racimo, Fernando; Marnetto, Davide; Huerta-Sánchez, Emilia (2017) Signatures of Archaic Adaptive Introgression in Present-Day Human Populations. Mol Biol Evol 34:296-317
Henn, Brenna M; Botigué, Laura R; Peischl, Stephan et al. (2016) Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc Natl Acad Sci U S A 113:E440-9
Poznik, G David; Xue, Yali; Mendez, Fernando L et al. (2016) Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet 48:593-9
Slavney, Andrea; Arbiza, Leonardo; Clark, Andrew G et al. (2016) Strong Constraint on Human Genes Escaping X-Inactivation Is Modulated by their Expression Level and Breadth in Both Sexes. Mol Biol Evol 33:384-93
Racimo, Fernando; Sankararaman, Sriram; Nielsen, Rasmus et al. (2015) Evidence for archaic adaptive introgression in humans. Nat Rev Genet 16:359-71
Hunter-Zinck, Haley; Clark, Andrew G (2015) Aberrant Time to Most Recent Common Ancestor as a Signature of Natural Selection. Mol Biol Evol 32:2784-97
Ma, Li; Keinan, Alon; Clark, Andrew G (2015) Biological knowledge-driven analysis of epistasis in human GWAS with application to lipid traits. Methods Mol Biol 1253:35-45
Rohlfs, Rori V; Aguiar, Vitor R C; Lohmueller, Kirk E et al. (2015) Fitting the Balding-Nichols model to forensic databases. Forensic Sci Int Genet 19:86-91
Henn, Brenna M; Botigué, Laura R; Bustamante, Carlos D et al. (2015) Estimating the mutation load in human genomes. Nat Rev Genet 16:333-43

Showing the most recent 10 out of 103 publications