A most basic difference between cells of the same genotype and different phenotype lies in their transcriptome. Understanding the difference between two transcriptomes in terms of the RNA molecules present in each, or changes in abundance of specific molecules, can offer valuable insight into the molecular mechanisms of disease, development, and specialization. High throughput sequencing provides a unique view of the transcriptome in the form of millions or even billions of short reads of nucleotide sequences sampled from the RNA molecules. To date, nearly 1000 such RNA-seq datasets have already been deposited in the NCBI Gene Expression Omnibus. Beyond measuring differences in overall expression of genes between samples, there is a critical need to measure differences in expression at the transcript level. Computational tools that can extract significant changes in transcript diversity across populations with RNA-seq are in immediate demand. However, reconstructing the full extent of transcript isoforms from this wealth of data is not a solved problem because of fundamental ambiguities between isoforms at the scale of the short read samples. We propose a novel approach to the differential analysis of transcriptomes that does not depend on the reconstruction of the full-length transcripts, and yet can accurately pinpoint the variation of transcriptomes. Our techniques are data-driven and applicable to any transcriptome, requiring only a reference genome, and do not depend on a priori gene structure annotations. Our research program builds on our highly sensitive and accurate MapSplice alignment algorithm to construct expression weighted splice graphs (ESG) from RNA-seq datasets. ESGs can be three orders of magnitude smaller in size than current RNA-seq datasets, yet fully represent the substantive biological content of such datasets. The ESG representation supports highly efficient analysis techniques that can directly identify and visualize statistically significant differential transcription between samples. Generalizations of the algorithms are proposed to identify co-regulated splicing patterns that are keys for biological pathway analyses and systems biology analyses. We have established an ongoing interactive and collaborative research environment among the co-PIs and Co-Is which include the biologists, computer scientists and statistician. The proposed computational methods will be tested and refined using RNA-seq data generated from breast cancer cell lines before being further applied to three well curated RNA-seq datasets on lung cancer pathogenesis, stem cells in leukemia, and equine articular cartilage development and repair (a non-model mammalian organism). Experimental validation of differentially expressed transcript isoforms will both improve the accuracy of our methods, as well as propose novel candidates for alternative isoforms associated with lung cancer,and leukemia diseases, and chondrocyte differentiation. The software will be open-source and will be developed as a set of components that can be used on their own or integrated into RNA-seq processing workflows. In particular we will integrate the components into the Galaxy cloud computing framework hosted on a local server. As such the methods will be available to researchers worldwide. As components mature they may be installed in other servers worldwide to provide a convenient and secure way to analyze transcriptomes. Unveiling the dynamics of the transcriptome at modest cost will revolutionize cellular diagnostics and biomedical research. Genome-wide measurement of transcription variants offers the potential for detailed molecular information about cellular identity and function that will greatly expand traditional histological assessment. Cloud-based access to the methods can turn individual laboratories into small genome centers and will enable individual scientists to assess differences among RNA transcriptomes in a matter of days. Our suite of algorithms will enable biomedical researchers to prioritize candidate genes or different gene ontology categories to investigate further for differential transcription and mechanistic importance between experimental conditions.

Public Health Relevance

This project will develop new techniques to compare the transcriptome of different cells, sampled using high-throughput sequencing technology. As the transcriptome consists of the RNA instructions controlling the cell's functions, differential analysis can provide detailed insight into the difference between healthy and diseased cells, or can be used to more clearly discover the functional differences between, say, liver cells and brain cells.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Feingold, Elise A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of North Carolina Chapel Hill
Biostatistics & Other Math Sci
Schools of Arts and Sciences
Chapel Hill
United States
Zip Code
Welch, Joshua D; Hartemink, Alexander J; Prins, Jan F (2016) SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol 17:106
Welch, Joshua D; Hu, Yin; Prins, Jan F (2016) Robust detection of alternative splicing in a population of single cells. Nucleic Acids Res 44:e73
Welch, Joshua D; Baran-Gale, Jeanette; Perou, Charles M et al. (2015) Pseudogenes transcribed in breast invasive carcinoma show subtype-specific expression and ceRNA potential. BMC Genomics 16:113
Huang, Yan; Hu, Yin; Liu, Jinze (2014) Piecing the puzzle together: a revisit to transcript reconstruction problem in RNA-seq. BMC Bioinformatics 15 Suppl 9:S3
Slevin, Michael K; Meaux, Stacie; Welch, Joshua D et al. (2014) Deep sequencing shows multiple oligouridylations are required for 3' to 5' degradation of histone mRNAs on polyribosomes. Mol Cell 53:1020-30
Simon, Jeremy M; Hacker, Kathryn E; Singh, Darshan et al. (2014) Variation in chromatin accessibility in human kidney cancer links H3K36 methyltransferase loss with widespread RNA processing defects. Genome Res 24:241-50
Huang, Yan; Hu, Yin; Jones, Corbin D et al. (2013) A robust method for transcript quantification with RNA-seq data. J Comput Biol 20:167-87
Jeck, William R; Sorrentino, Jessica A; Wang, Kai et al. (2013) Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA 19:141-57
Cabanski, Christopher R; Wilkerson, Matthew D; Soloway, Matthew et al. (2013) BlackOPs: increasing confidence in variant detection through mappability filtering. Nucleic Acids Res 41:e178
Hu, Yin; Huang, Yan; Du, Ying et al. (2013) DiffSplice: the genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res 41:e39

Showing the most recent 10 out of 16 publications