The proportion of the human genome that underlies gene regulation dwarfs the proportion that encodes proteins. However, we remain poorly equipped for identifying which genetic variants compromise gene regulatory function in ways that may contribute to risk for both rare and common human diseases. Understanding how non-coding sequences regulate gene expression, as well as being able to predict the functional consequences of genetic variation for gene regulation, are paramount challenges for the field. Here, we propose to combine synthetic biology, massively parallel functional assays, and machine learning to profoundly advance our understanding of the `regulatory code' of the human genome. While challenging, the task of unravelling complex codes from large amounts of empirical data is not without precedent. For example, over the past decade, computer scientists working in natural language processing have made immense progress, driven in large part by a combination of algorithmic and computational improvements and enormously larger training datasets than were available to the previous generations of scientists working in this area. Inspired by the revolutionizing impact of ?big data? for traditional problems in machine learning, we propose to model gene regulatory phenomena using training datasets with several orders of magnitude more examples than naturally exist in the human genome. We predict that the models learned from massive numbers of synthetic examples will strongly outperform models learned from the small number of natural examples. We will demonstrate our approach by developing comprehensive, quantitative, and predictive models for alternative splicing and alternative polyadenylation, two widespread regulatory mechanisms by which a single gene can code for multiple transcripts and proteins. However, we anticipate that this basic paradigm ? specifically, the massively parallel measurement of the functional behavior of extremely large numbers of synthetic sequences followed by quantitative modeling of sequence-function relationships ? can be generalized to advance our understanding of diverse forms of gene regulation.

Public Health Relevance

This research seeks to develop predictive models of alternative splicing and polyadenylation by learning from millions of synthetic constructs, orders of magnitude more than the number of endogenous examples. These models will be applied for understanding the consequences of genetic variation in humans and how this variation can lead to disease.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG009136-02
Application #
9475243
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Chadwick, Lisa
Project Start
2017-04-21
Project End
2021-01-31
Budget Start
2018-02-01
Budget End
2019-01-31
Support Year
2
Fiscal Year
2018
Total Cost
Indirect Cost
Name
University of Washington
Department
Engineering (All Types)
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195
Starita, Lea M; Ahituv, Nadav; Dunham, Maitreya J et al. (2017) Variant Interpretation: Functional Assays to the Rescue. Am J Hum Genet 101:315-325