Most disease-associated variants lie in non-coding regions of the genome and exert their influence through effects on gene expression. However, we lack a predictive framework to interpret such non-coding variants, limiting how genomic data is used in precision medicine. We may be able to interpret non-coding variants with new machine learning algorithms, but so far the practical applications of machine learning in functional genomics have been limited because of two major challenges. First, the size and diversity of training data sets in functional genomics are orders of magnitude smaller than in applications where machine learning has been successful, such as image recognition and product recommendation. A second challenge is that if training data are not collected in an appropriate in vitro cellular model, then the resulting machine learning models may not generalize to relevant in vivo cell types. To improve the application of machine learning to non-coding variants, I propose to address both the limited size of training data sets and the efficacy of cell culture models. A core principle of machine learning is that model performance improves with more data.
In Aim 1, I propose to increase the size and diversity of training data by performing iterative cycles of machine learning and experimental validation with Massively Parallel Reporter Assays (MPRAs). The key aspect of my approach is to algorithmically design each successive MPRA library to contain sequences that are most likely to improve the next round of modeling. I recently trained my first model on data that I collected from MPRA experiments of cis-regulatory sequences that function in mammalian photoreceptors. To avoid any issues with cell lines, I performed these experiments in ex vivo developing retinas, which retain the appropriate tissue architecture. However, unlike photoreceptors, most cell types are not experimentally tractable in their native physiological context. Thus, it will be important to determine how well in vitro cell lines recapitulate in vivo cis-regulation.
In Aim 2, I propose to determine whether a tractable cell culture model can recapitulate results from ex vivo retinas. I will use existing MPRA data from ex vivo retinas as a standard to compare against data collected in cell lines engineered to express combinations of photoreceptor transcription factors.
I aim to address whether engineering tractable cell lines to express tissue-specific transcription factors might be a general approach for collecting data to train machine learning models that generalize to in vivo systems. Successful completion of these aims will produce a general approach to increase the size and diversity of functional genomic training data, and may result in a general method for producing experimentally tractable systems for machine learning applications, ultimately helping us better apply genomic data to precision medicine.
Many genetic variants that cause disease do not affect the structure of genes, but instead affect short DNA sequences that control when, where, or how much a gene is produced. Every individual human contains thousands of genetic variants in these control regions, but only a small number of these variants influence gene production. I propose a combined experimental and computational framework to predict which variants affect gene production and which variants are harmless.