Comprehensive identification of all functional elements encoded in genomes is a fundamental need in both basic and applied biological research. Although the coding regions of genomes are well understood, the noncoding regions, representing over 98% of mammalian genomes, are far less studied, but hold the key to understanding gene regulation, evolution, genetic basis of complex phenotypes, etc. The goal of this project is to develop computational methods to infer the function of noncoding sequences by leveraging the plethora of data from publicly available genomic data and state-of-the-art algorithms from machine learning. These algorithms can greatly expand the utility of existing genomic data, improving the accuracy of annotating pathogenicity of noncoding variants, and offering a new way of studying grammars of gene regulation encoded by noncoding sequence. The project will additionally create opportunities to facilitate interactions between biologists and computer scientists, and offer interdisciplinary training for both undergraduate and graduate students, especially those from traditionally underrepresented groups.
The goal of this project is to develop a new computational framework based on deep learning to understand noncoding sequences. Over the past few years, researchers have generated thousands of genome-scale datasets on chromatin accessibility, histone modifications, DNA methylation, protein-binding, and others, spanning a broad range of tissue and cell types. This project will integrate these heterogeneous datasets to derive a comprehensive characterization of noncoding sequence through innovative machine learning algorithms based on convolutional and recurrent neural nets, and deep generative models. The PI will develop deep learning algorithms to map the relationship between noncoding sequences and the diverse genomic measurements, learn chromatin states and discover novel functional elements from these measurements, and predict effects of noncoding genetic variants. Training a flexible and scalable learning model with large amounts of data provides a way of characterizing noncoding sequences in an unbiased and robust fashion, and offers a better chance of extracting complex regulatory rules encoded within noncoding sequences than conventional methods. This project will provide the genomics community with a versatile, modular, open-source toolbox of software packages, with the goal of greatly improving the accuracy of current genome analyses.