Over the past decade, sequencing technologies have been developed that enable the profiling of gene expression across a wide variety of organisms and tissue types. These technologies allow the investigation, on a transcriptome-wide scale, of how gene expression changes in different conditions, under various stimuli, and in different disease states. These technologies are transformative in progressing basic science (e.g., understanding cell biology) and applied science (e.g., approaches to drug development). However, the deluge of data produced by these technologies brings with it a host of computational challenges, such as discovering if samples contain genes previously not annotated, accurately determining the sequence of these genes, and quantifying the abundance of all the genes expressed in a sample. Much effort has been dedicated to developing reliable computational methods for processing this data. Yet, even the best existing solutions are sometimes unsatisfactory in terms of their accuracy, and are becoming computationally burdensome given the rapid rate at which new data is being produced.

The goal of this project is to develop a new generation of accurate and lightweight methods for analyzing gene and transcript expression using sequencing data. These tools will apply new data structures and algorithmic ideas to the problems of mapping sequencing reads, discovering and assembling new transcripts, and accurately and robustly quantifying gene expression. Further, these methods will work in the context of both established technologies and the newly-emerging protocols that allow measuring cell-specific gene expression across thousands of individual cells. The methods and software produced as a result of this project will help enable new discoveries by being more sensitive and accurate than existing approaches, will reduce costs by decreasing computational demands, and will speed up analyses by producing results more quickly than existing approaches. The outreach goals of this project include the creation of educational media including videos and a podcast series that will help convey key insights and benefits of new computational genomics methods to both practicing biologists as well as to the scientifically-interested public at large.

Lightweight quantification methods streamline many common transcriptomic analyses, like differential expression testing in well-annotated organisms and common tissue types. Yet, substantial challenges remain that prevent the use of lightweight methods in many analysis tasks, e.g., when novel transcripts should be considered, or when events such as intron retention may play an important role. This work will advance the accuracy and fundamental capabilities of lightweight transcriptome analysis methods. Specifically, a new graph-based data structure will be developed for indexing a collection of reference sequences. A lightweight alignment tool will be built around this index that will incorporate a statistical model that allows sharing of splicing contexts across large collections of samples to guide and inform difficult alignment problems. A multi-sample methodology for joint transcript discovery and quantification will also be developed, based on new approaches to modeling the joint likelihood of transcript sequences and their abundances. Efficient likelihood factorizations will allow this approach to remain computationally convenient. Finally, a suite of tools for processing and quantifying high-throughput, single-cell RNA-seq data will be developed. These tools will adopt a novel approach for solving the cell barcoding, UMI deduplication, and gene expression estimation problems jointly, and in a unified statistical framework. The underlying model will share statistical information between cells to improve clustering and quantification, and to analyze expression at the resolution supported by the data, i.e., as groups of distinguishable isoforms. All of these tools will be released as high-quality, open-source software.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
2029424
Program Officer
Mitra Basu
Project Start
Project End
Budget Start
2020-04-13
Budget End
2023-01-31
Support Year
Fiscal Year
2020
Total Cost
$523,529
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742