Genome-wide association studies have been incredibly successful at identifying novel genes and pathways associated with a wide array of complex diseases. However, despite the formation of large consortia to perform meta-analyses across cohorts, only a small fraction of the expected heritability of most common, complex diseases has been explained. The human genetics community is now adopting large-scale sequencing approaches (e.g., exome and whole genome) to identify rare variants that potentially have larger phenotypic effects. In response, statistical geneticists have created a litany of tests for geared toward associating rare variants with disease. We hypothesize that the most parsimonious explanation for an inverse relationship between the frequency of causal alleles and their effect size is that many diseases are caused by an influx of newly arising deleterious mutations that are continually removed from the population due to natural selection. We therefore propose to develop simulation software that will integrate what we know about how allele frequencies change over time from the theory-rich field of population genetics into the data-rich field of human genetics. Our resulting software will be used to develop strategies for sequencing global cohorts with high discovery power, and to aid in the evaluation of existing/future statistical tests. To achieve broad impact, we will create a graphical user interface (GUI) that produces effective figures, and apply our tool to compare and contrast a wide variety of existing statistical tests. We will then revamp our population genetic simulator to become the first population genetic simulator based on the heterogeneous computing architecture of both CPUs and graphical processing units (GPUs). Through intensive parallelization, our software will achieve disruptive efficiency. Using this approach, we will develop a platform for simulation-based inference that can accommodate complex evolutionary models. We will apply this approach to analyze forthcoming whole genome sequencing data from humans and Drosophila. Finally, we aim to return cutting-edge research to the classroom by developing simulation-based teaching tools. Our teaching tool will be in the form of a GUI that enables hands-on learning of complex concepts.

Public Health Relevance

The next phase of genome-wide association studies (GWAS) will require whole genome resequencing. Make sense of this onslaught of data using the numerous tools that are currently being developed requires accurate simulation tools. We propose to continue development and maintenance of our population genetic simulator to become a driving force for designing high-powered sequencing-based association studies, inference of complex evolutionary models, and to bring research back to the classroom in the form of teaching tools.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California San Francisco
Schools of Pharmacy
San Francisco
United States
Zip Code
Szpiech, Zachary A; Hernandez, Ryan D (2014) selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol 31:2824-7