A most basic difference between cells of the same genotype and different phenotype lies in their transcriptome. Understanding the difference between two transcriptomes in terms of the RNA molecules present in each, or changes in abundance of specific molecules, can offer valuable insight into the molecular mechanisms of disease, development, and specialization. High throughput sequencing provides a unique view of the transcriptome in the form of millions or even billions of short reads of nucleotide sequences sampled from the RNA molecules. To date, nearly 1000 such RNA-seq datasets have already been deposited in the NCBI Gene Expression Omnibus. Beyond measuring differences in overall expression of genes between samples, there is a critical need to measure differences in expression at the transcript level. Computational tools that can extract significant changes in transcript diversity across populations with RNA-seq are in immediate demand. However, reconstructing the full extent of transcript isoforms from this wealth of data is not a solved problem because of fundamental ambiguities between isoforms at the scale of the short read samples. We propose a novel approach to the differential analysis of transcriptomes that does not depend on the reconstruction of the full-length transcripts, and yet can accurately pinpoint the variation of transcriptomes. Our techniques are data-driven and applicable to any transcriptome, requiring only a reference genome, and do not depend on a priori gene structure annotations. Our research program builds on our highly sensitive and accurate MapSplice alignment algorithm to construct expression weighted splice graphs (ESG) from RNA-seq datasets. ESGs can be three orders of magnitude smaller in size than current RNA-seq datasets, yet fully represent the substantive biological content of such datasets. The ESG representation supports highly efficient analysis techniques that can directly identify and visualize statistically significant differential transcription between samples. Generalizations of the algorithms are proposed to identify co-regulated splicing patterns that are keys for biological pathway analyses and systems biology analyses. We have established an ongoing interactive and collaborative research environment among the co-PIs and Co-Is which include the biologists, computer scientists and statistician. The proposed computational methods will be tested and refined using RNA-seq data generated from breast cancer cell lines before being further applied to three well curated RNA-seq datasets on lung cancer pathogenesis, stem cells in leukemia, and equine articular cartilage development and repair (a non-model mammalian organism). Experimental validation of differentially expressed transcript isoforms will both improve the accuracy of our methods, as well as propose novel candidates for alternative isoforms associated with lung cancer,and leukemia diseases, and chondrocyte differentiation. The software will be open-source and will be developed as a set of components that can be used on their own or integrated into RNA-seq processing workflows. In particular we will integrate the components into the Galaxy cloud computing framework hosted on a local server. As such the methods will be available to researchers worldwide. As components mature they may be installed in other servers worldwide to provide a convenient and secure way to analyze transcriptomes. Unveiling the dynamics of the transcriptome at modest cost will revolutionize cellular diagnostics and biomedical research. Genome-wide measurement of transcription variants offers the potential for detailed molecular information about cellular identity and function that will greatly expand traditional histological assessment. Cloud-based access to the methods can turn individual laboratories into small genome centers and will enable individual scientists to assess differences among RNA transcriptomes in a matter of days. Our suite of algorithms will enable biomedical researchers to prioritize candidate genes or different gene ontology categories to investigate further for differential transcription and mechanistic importance between experimental conditions.

Public Health Relevance

This project will develop new techniques to compare the transcriptome of different cells, sampled using high-throughput sequencing technology. As the transcriptome consists of the RNA instructions controlling the cell's functions, differential analysis can provide detailed insight into the difference between healthy and diseased cells, or can be used to more clearly discover the functional differences between, say, liver cells and brain cells.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Feingold, Elise A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of North Carolina Chapel Hill
Biostatistics & Other Math Sci
Schools of Arts and Sciences
Chapel Hill
United States
Zip Code
Zhang, Yi; Liu, Xinan; MacLeod, James et al. (2018) Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 19:971
Liu, Xinan; Yu, Ye; Liu, Jinpeng et al. (2018) A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures. Bioinformatics 34:171-178
Welch, Joshua D; Hartemink, Alexander J; Prins, Jan F (2017) MATCHER: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol 18:138
Su, Wei; Slevin, Michael K; Marzluff, William F et al. (2016) Synthetic mRNA with Superior Properties that Mimics the Intracellular Fates of Natural Histone mRNA. Methods Mol Biol 1428:93-114
Welch, Joshua D; Williams, Lindsay A; DiSalvo, Matthew et al. (2016) Selective single cell isolation for genomics using microraft arrays. Nucleic Acids Res 44:8292-301
Welch, Joshua D; Hartemink, Alexander J; Prins, Jan F (2016) SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol 17:106
Welch, Joshua D; Hu, Yin; Prins, Jan F (2016) Robust detection of alternative splicing in a population of single cells. Nucleic Acids Res 44:e73
Cancer Genome Atlas Network (2015) Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517:576-82
Hestand, Matthew S; Kalbfleisch, Theodore S; Coleman, Stephen J et al. (2015) Annotation of the Protein Coding Regions of the Equine Genome. PLoS One 10:e0124375
Hestand, Matthew S; Zeng, Zheng; Coleman, Stephen J et al. (2015) Tissue Restricted Splice Junctions Originate Not Only from Tissue-Specific Gene Loci, but Gene Loci with a Broad Pattern of Expression. PLoS One 10:e0144302

Showing the most recent 10 out of 27 publications