It is now possible to readily identify sequence variants that have been subject to natural selection in human populations. Adaptive variants have been shown to underlie diversity in both disease risk and morphology across human populations, suggesting that the lens of evolution remains a powerful tool for understanding human biology, yet one that is currently underutilized. One challenge to realizing this potential is that over 99% of human genetic variation is non-coding, as are most signals of selection and variants associated with human phenotypic traits. Mutations modifying transcription factor (TF) binding sites in non- coding, cis-regulatory regions are hypothesized to be more amenable to driving phenotypic evolution than coding changes. These cis-regulatory regions are major determinants of tissue-specific gene expression and variation within enhancers has been indicated to play role in selection and disease. However, enhancer `regulatory grammar' ? the complex pattern of sequences that interact with TFs to control gene expression, is poorly understood. The challenge of linking a regulatory variant under selection to an adaptive phenotype is similar to that facing disease association studies: the targeted gene, cell type, and biological process are often unknown, hindering further investigation. Machine learning algorithms automate the discovery of patterns, making them well-suited to uncover sequence constraints and the combinatorial TF binding patterns of enhancers without relying on limited motif databases. I will optimize support vector machine classifiers to elucidate regulatory grammar from over 100 cell types and tissues. These models will be used to predict the impact of sequence variants on cis-regulatory sites, expanding the utility of the NHGRI's ENCODE and Roadmap projects. I will then apply these tissue- specific predictions to signals of selection and/or disease from the NHGRI's 1000 Genomes Project and its genome-wide association studies catalog. By describing the regulatory impact of signals of selection, I will globally describe patterns of functional adaptation across populations, identifying genes, and gene networks targeted by selection and/or disease. I will functionally characterize variants using massively parallel reporter assays and genome editing to give deeper insight into specific cases of evolution relevant for human health. This project seeks to develop tools and resources to describe the structure and function of sequence variation in the human genome. This proposal seeks to vastly increase the number of elucidated cases of human evolution, and specifically characterize those adaptive variants associated with disease. We will share the functional regulatory predictions for the use in interpreting a broad range of genomic datasets. This work has broad implications, from interpreting genetic variants in populations and understanding functional targets of evolution, to prioritizing non-coding mutations for precision genomic medicine and beyond.

Public Health Relevance

While thousands of genomic regions have been linked to human evolution and diseases, many of the genetic variants responsible are non-coding and thus difficult to interpret. I propose to describe the tissue-specific effects of regulatory variants using machine learning algorithms, and then integrate these functional predictions with signals of selection and/or disease association in order to pinpoint genetic variants important for human evolution and health.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Postdoctoral Individual National Research Service Award (F32)
Project #
5F32HG009226-03
Application #
9614325
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Gatlin, Tina L
Project Start
2017-01-01
Project End
2019-08-31
Budget Start
2019-01-01
Budget End
2019-08-31
Support Year
3
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Harvard University
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
082359691
City
Cambridge
State
MA
Country
United States
Zip Code
02138
Tewhey, Ryan; Kotliar, Dylan; Park, Daniel S et al. (2016) Direct Identification of Hundreds of Expression-Modulating Variants using a Multiplexed Reporter Assay. Cell 165:1519-1529