The leading and trailing untranslated regions (UTRs) of an mRNA, along with the coding sequence (CDS), control protein production by modulating translation and mRNA stability. However, although we have identified a vast number of regulatory features in these regions, we are still far from being able to predict, for example, whether and how a sequence variant affects the levels of protein being made. Here, we propose to combine high-throughput experimental characterization of protein expression in synthetic libraries with machine learning to create predictive models of translation and mRNA stability, addressing an urgent need. Recent progress in machine vision, voice recognition and other fields of computer science has been driven by the availability of enormous data sets on which to train models. Machine learning approaches have also had remarkable impact in biology, but biological data sets often are comparatively small, limiting the quality of models that can be learned. For example, there are only around 20,000 genes in the human genome, a restrictively small set of examples for training a predictive model that captures the full extent of the genome?s ?regulatory code.? In this proposal, we aim to overcome this data size limitation by training predictive models of protein expression on data from millions of synthetic constructs -- a data set several orders of magnitude larger than the number of genes in the genome. Specifically, we will create libraries of in vitro transcribed mRNA with targeted variation in the UTRs and CDS and will assay protein expression of each library member by performing high-throughput polysome profiling, ribosome profiling, and mRNA stability assays. We will then use neural network approaches to learn predictive models of the relationship between mRNA sequence and levels of protein production. We will apply our models to three applications of practical importance: first, we expect to uncover novel biology, for example identifying regulatory sequence elements and interactions between them. Second, we will validate our models through the de novo design and experimental testing of sequences that result in higher levels or protein production than any of the millions of randomly generated members of the original library or than the endogenous UTR sequences currently used in biotechnology. Such stable and highly translating mRNA constructs would be of particular value for the field or mRNA therapeutics. Third, we will predict the functional consequences of genetic variation in UTRs on protein production and we will validate these predictions experimentally. We are far from understanding which genetic variants compromise gene regulatory function in ways that may contribute to disease, making such a comprehensive and quantitative analysis of variants valuable.

Public Health Relevance

This research project aims to use machine learning to build predictive models of translation and mRNA stability. Models are trained on data obtained by measuring protein production for millions of synthetic RNA constructs. Models will then be applied for understanding the consequences of genetic variation in humans and for the design of mRNA therapeutics.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Engineering (All Types)
Biomed Engr/Col Engr/Engr Sta
United States
Zip Code