This project aims to develop an efficient and accurate new computational method for identifying novel transcripts and their expression levels. Transcriptome assembly and gene expression profiling are key components in a vast range of biological experiments today, playing a central role in unraveling the complexity of cell types, cell differentiation, responses to stress, and myriad other conditions. Although transcript assemblers have been developed previously, most of them perform poorly on real, large-scale RNA sequencing data sets, severely limiting their impact. To produce better transcript models, an innovative new method will be developed, combining ideas from several scientific disciplines. By ensuring that this method works on the very large data sets that are routinely produced by modern next-generation sequencing instruments, this project will have an impact on a wide range of studies across the spectrum of eukaryotic species. It will also enhance the research infrastructure by providing free, open source software that can be re-used by other scientists for commercial, educational, or basic research endeavors.
This new method uses an optimization technique known as maximum flow in a specially-constructed flow network to determine gene expression levels, and it does this while simultaneously assembling each splice variant of a gene. It also incorporates techniques from whole-genome assembly, which has the potential to dramatically improve detection of alternative splice variants. By using pre-assembled reads, the computational load and memory storage requirements associated with transcriptome assembly will be greatly reduced, as many of the short reads will be combined into longer contigs that span multiple exons. Furthermore, the new method will address a critical need for a transcriptome assembly method that is able to handle the numerous gaps present in draft genomes, and to produce better-assembled transcripts by stitching together portions of transcripts situated on multiple fragments of the genome. The results of this project will be disseminated at http://ccb.jhu.edu.