Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences

Seelig, Georg; Shendure, Jay

Abstract

The proportion of the human genome that underlies gene regulation dwarfs the proportion that encodes proteins. However, we remain poorly equipped for identifying which genetic variants compromise gene regulatory function in ways that may contribute to risk for both rare and common human diseases. Understanding how non-coding sequences regulate gene expression, as well as being able to predict the functional consequences of genetic variation for gene regulation, are paramount challenges for the field. Here, we propose to combine synthetic biology, massively parallel functional assays, and machine learning to profoundly advance our understanding of the `regulatory code' of the human genome. While challenging, the task of unravelling complex codes from large amounts of empirical data is not without precedent. For example, over the past decade, computer scientists working in natural language processing have made immense progress, driven in large part by a combination of algorithmic and computational improvements and enormously larger training datasets than were available to the previous generations of scientists working in this area. Inspired by the revolutionizing impact of ?big data? for traditional problems in machine learning, we propose to model gene regulatory phenomena using training datasets with several orders of magnitude more examples than naturally exist in the human genome. We predict that the models learned from massive numbers of synthetic examples will strongly outperform models learned from the small number of natural examples. We will demonstrate our approach by developing comprehensive, quantitative, and predictive models for alternative splicing and alternative polyadenylation, two widespread regulatory mechanisms by which a single gene can code for multiple transcripts and proteins. However, we anticipate that this basic paradigm ? specifically, the massively parallel measurement of the functional behavior of extremely large numbers of synthetic sequences followed by quantitative modeling of sequence-function relationships ? can be generalized to advance our understanding of diverse forms of gene regulation.

Public Health Relevance

This research seeks to develop predictive models of alternative splicing and polyadenylation by learning from millions of synthetic constructs, orders of magnitude more than the number of endogenous examples. These models will be applied for understanding the consequences of genetic variation in humans and how this variation can lead to disease.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 1R01HG009136-01A1
Application #: 9306648
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Pazin, Michael J

Project Start: 2017-04-21
Project End: 2021-01-31
Budget Start: 2017-04-21
Budget End: 2018-01-31
Support Year: 1
Fiscal Year: 2017
Total Cost: $536,954
Indirect Cost: $185,371

Institution

Name: University of Washington
Department: Engineering (All Types)
Type: Schools of Engineering
DUNS #: 605799469

City: Seattle
State: WA
Country: United States
Zip Code: 98195

Related projects


NIH 2020 R01 HG	Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences Seelig, Georg; Shendure, Jay Ashok / University of Washington
NIH 2019 R01 HG	Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences Seelig, Georg; Shendure, Jay Ashok / University of Washington
NIH 2018 R01 HG	Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences Seelig, Georg; Shendure, Jay Ashok / University of Washington
NIH 2017 R01 HG	Predictive Modeling of Alternative Splicing and Polyadenylation from Millions of Random Sequences Seelig, Georg; Shendure, Jay Ashok / University of Washington	$536,954

Publications

Starita, Lea M; Ahituv, Nadav; Dunham, Maitreya J et al. (2017) Variant Interpretation: Functional Assays to the Rescue. Am J Hum Genet 101:315-325

Comments

Be the first to comment on Georg Seelig's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: