The proper function and health of an organism rests on the correct expression of it's genes: in the first step in expression, RNA molecules are produced from the genes in a number of possible forms. Accurately determining how much RNA is produced and the structure of that RNA are the goals of this research. Many experiments do high throughput sequencing of RNA to show how much gene expression is taking place, what parts of the genomic DNA are making the RNA, and how DNA regions combine to make functional RNA. There are many steps required to process RNA and get sequence data, leading to a lot of noise in the data. Errors also occur when trying to compare the RNA sequence to a genome sequence that has gaps in it or that was not correctly assembled. The effect of the noise and errors is that calculating how much of each type of RNA is present is not very accurate, which can give misleading results. The aim of this research is to develop methods that overcome the technical problems so that good quantitation and better understanding of biological processes are possible. The new algorithms will be incorporated into software packages available for use by interested members of the scientific community, so that the benefits of the improvements will be widely shared. In addition, better analysis of RNA sequencing experiments is expected to have a positive impact on many scientific disciplines, from basic cell biology to development of clinical tests.

High-throughput sequencing of RNA has proven itself as an invaluable tool for gene discovery and the annotation of new isoforms for both coding and non-coding genes. However, it is still falls short on its ultimate promise of providing quantitative and comparative measures of transcript abundance. This gap is due to a series of technical factors. Among them are biases introduced by employing an inexact reference genome as the standard for associating sequence data to transcripts, noise due to misalignments causes by paralogous sequence such as pseudogenes, biases introduced by unannotated transcripts, sense/antisense transcript interference, and origin bias due to aligning diploid data to a haploid model. The objective of the project is to develop methods that either overcome or side-step all of these factors in an effort to deliver on the promise of RNA sequencing for quantitative analysis. Our research plan includes developing computational models and efficient algorithms for simultaneous rebalancing reads between genes and pseudogenes and genes within gene families, robust alignment-free methods for estimating transcript abundances and allele-specific expression patterns, and de novo approach for isoform and novel transcript discovery using DNAseq and RNAseq from a single sample. The proposed computational tools will be integrated into software packages under common application framework adopted by the broad scientific community. The results of the project can be found at www.cs.ucla.edu/~weiwang/NSF1565137.html

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1565137
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2016-07-01
Budget End
2021-06-30
Support Year
Fiscal Year
2015
Total Cost
$600,000
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095