Mutation and natural selection are fundamental forces of evolution, and their intensities across the genome are key factors in determining the genomic landscape of human genetic disease variation and evolution. The goal of the proposal is to construct a detailed map of mutation rates and purifying selection along the human genome using novel statistical methodologies. Existing approaches to estimating mutation rates and selection are often based on genome comparison across species, but for the purpose of studying human genetics and evolution, we believe those inferred from the human population are more relevant and increasingly feasible thanks to large-scale sequencing. Statistical methods for intra-human analysis, however, are in their infancy, and face a number of challenges; for example, many factors affecting mutation rates are unknown and complex human demographic changes complicate the inference of selection. We propose three specific aims: (1) Estimation of base-level mutation rates across the human genome. We will use de novo mutations from pedigree sequencing data to directly estimate germline mutation rates. Our model will incorporate a large set of genomic features potentially associated with mutation rates, including novel ones not utilized by earlier methods such as DNA structure and epigenomic information in germ line cells. Our statistical model also incorporates a random effect component and captures spatial correlations of mutation rates between nearby regions at multiple scales. (2) Inference of purifying selection in the human genome. Existing methods for detecting intra-species constraint often rely on one of multiple signatures of selection a time (e.g. depletion of variants comparing with neutral expectation), and have limited power in detecting selection on individual elements, such as a putative enhancer.! We will develop a unified statistical model that leverages several major signals to detect selection at both base and element levels. Our model uses the powerful Poisson Random Field (PRF) model, taken complex human demographic history into account. We also leverage mutation rates estimates from Aim 1 and use a number of genomic annotations to set prior distribution of selection effects through a hierarchical Bayesian model. (3) Studying the role of human constrained sequences in disease genetics. We hypothesize that sequences under selective constraint in human, both coding and noncoding ones, are highly enriched with disease causing variants. We will test this hypothesis using data from Genome-wide Association studies (GWAS), with a special focus on neuropsychiatric phenotypes. We will develop procedures that leverage both functional genomic data and selective constraints to prioritize disease variants.
The goal of this proposal is to develop a set of computational tools and software to infer which regions of the human genome evolve under constraint. These constrained sequences are functionally important and are known to be enriched with disease-causing variants. Our proposal will lead to better analysis of large scale DNA sequencing data to discover important disease-related DNA sequence elements.