Population-based studies identifying common genetic variants that affect complex human diseases have relied heavily on population-genetic principles in important tasks such as study design, quality control, and genotype imputation. As the emphasis of mapping studies has now shifted to investigating rare variants in next- generation sequencing projects, new opportunities exist for leveraging population genetics to maximize the return from these investigations. Because studies thus far have often focused on populations of European descent, it is critical that new methods provide tools to analyze data from a greater diversity of populations. This project builds on productive efforts in the first funding period, proposing methods that capitalize on the study of human population genetics to enhance the design, analysis, and interpretation of genome sequencing studies, and focusing on analysis of rare risk variants in diverse human populations. (1) We will devise methods for selecting subsamples of individuals for genome and exome sequencing, particularly in admixed and structured populations. Such subsamples will make it possible for researchers to maximize their potential for achieving statistical power to detect rare disease variants. (2) We will enhance variant-calling accuracy, particularly in low-coverage data and for challenging indels and copy-number variants, by including in the variant-calling pipeline evidence accumulated from closely related haplotypes in the population. This approach will be particularly beneficial in admixed and genetically diverse populations, in which haplotype variation is especially significant and selecting an informative haplotype subset to assist in variant-calling is of greatest value. (3) We will use population-genetic principles to improve sample quality control in sequencing studies. First, we address the common challenge of sample contamination, which adversely affects variant-calling and downstream analyses. We will produce a method to estimate the genotypes of the minor contributor of a mixed sample, thus enabling the population of origin of a contaminating signal to be identified. This identification further facilitates variant-calling and permits in silico deconvolution of mixed samples. Second, to enhance the sharing of samples in large projects, we will devise methods to uncover duplicate or related samples from non- overlapping marker sets. Our approach will reduce the risk of expending effort to obtain sequence that will not be fully utilized, and will also assist in making use of historical low-density data in understudied populations. (4) We will incorporate new advances in the study of human population growth and natural selection for evaluating rare-variant tests and identifying powerful testing strategies. Evaluations of current tools often ignore important population-genetic factors such as selection or accelerating growth; our methods will enhance models for analyzing rare-variant testing methods, tailoring them to populations of interest. Throughout the project, we will use multi-population genome sequence data from the TopMed and InPSYght studies to test our approaches. To facilitate use of our methods, we will produce, test, and distribute new publicly available software programs.

Public Health Relevance

Population-based studies that assess large samples of unrelated cases and controls offer a powerful approach to identify risk variants for common complex diseases. However, many methods for addressing the current focus of these studies on rare risk variants and genome sequencing make limited use of informative models from population genetics, and they often do not consider complexities inherent to studies of populations of non- European origin. Our project will leverage models from population genetics to provide methods and software that will accelerate the discovery of genetic factors that increase disease risk, addressing challenges arising from consideration of rare genetic variation, large sample sizes, complex sequencing projects, and the effort to find disease variants in underrepresented populations.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Schools of Arts and Sciences
United States
Zip Code
Aw, Alan J; Rosenberg, Noah A (2018) Bounding measures of genetic similarity and diversity using majorization. J Math Biol 77:711-737
Reppell, M; Zöllner, S (2018) An efficient algorithm for generating the internal branches of a Kingman coalescent. Theor Popul Biol 122:57-66
Kim, Jaehee; Edge, Michael D; Algee-Hewitt, Bridget F B et al. (2018) Statistical Detection of Relatives Typed with Disjoint Forensic and Biomedical Loci. Cell 175:848-858.e6
Arbisser, Ilana M; Jewett, Ethan M; Rosenberg, Noah A (2018) On the joint distribution of tree height and tree length under the coalescent. Theor Popul Biol 122:46-56
Edge, Michael D; Algee-Hewitt, Bridget F B; Pemberton, Trevor J et al. (2017) Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets. Proc Natl Acad Sci U S A 114:5671-5676
Vattathil, Selina; Scheet, Paul (2016) Extensive Hidden Genomic Mosaicism Revealed in Normal Tissue. Am J Hum Genet 98:571-578
Kang, Jonathan T L; Goldberg, Amy; Edge, Michael D et al. (2016) Consanguinity Rates Predict Long Runs of Homozygosity in Jewish Populations. Hum Hered 82:87-102
Kang, Jonathan T L; Zhang, Peng; Zöllner, Sebastian et al. (2015) Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf. Genetics 201:499-511
Lo, Yancy; Kang, Hyun M; Nelson, Matthew R et al. (2015) Comparing variant calling algorithms for target-exon sequencing in a large sample. BMC Bioinformatics 16:75
Buzbas, Erkan O; Rosenberg, Noah A (2015) AABC: approximate approximate Bayesian computation for inference in population-genetic models. Theor Popul Biol 99:31-42

Showing the most recent 10 out of 43 publications