A most basic difference between cells of the same genotype and different phenotype lies in their transcriptome. Understanding the difference between two transcriptomes in terms of the RNA molecules present in each, or changes in abundance of specific molecules, can offer valuable insight into the molecular mechanisms of disease, development, and specialization. High throughput sequencing provides a unique view of the transcriptome in the form of millions or even billions of short reads of nucleotide sequences sampled from the RNA molecules. To date, nearly 1000 such RNA-seq datasets have already been deposited in the NCBI Gene Expression Omnibus. Beyond measuring differences in overall expression of genes between samples, there is a critical need to measure differences in expression at the transcript level. Computational tools that can extract significant changes in transcript diversity across populations with RNA-seq are in immediate demand. However, reconstructing the full extent of transcript isoforms from this wealth of data is not a solved problem because of fundamental ambiguities between isoforms at the scale of the short read samples. We propose a novel approach to the differential analysis of transcriptomes that does not depend on the reconstruction of the full-length transcripts, and yet can accurately pinpoint the variation of transcriptomes. Our techniques are data-driven and applicable to any transcriptome, requiring only a reference genome, and do not depend on a priori gene structure annotations. Our research program builds on our highly sensitive and accurate MapSplice alignment algorithm to construct expression weighted splice graphs (ESG) from RNA-seq datasets. ESGs can be three orders of magnitude smaller in size than current RNA-seq datasets, yet fully represent the substantive biological content of such datasets. The ESG representation supports highly efficient analysis techniques that can directly identify and visualize statistically significant differential transcription between samples. Generalizations of the algorithms are proposed to identify co-regulated splicing patterns that are keys for biological pathway analyses and systems biology analyses. We have established an ongoing interactive and collaborative research environment among the co-PIs and Co-Is which include the biologists, computer scientists and statistician. The proposed computational methods will be tested and refined using RNA-seq data generated from breast cancer cell lines before being further applied to three well curated RNA-seq datasets on lung cancer pathogenesis, stem cells in leukemia, and equine articular cartilage development and repair (a non-model mammalian organism). Experimental validation of differentially expressed transcript isoforms will both improve the accuracy of our methods, as well as propose novel candidates for alternative isoforms associated with lung cancer,and leukemia diseases, and chondrocyte differentiation. The software will be open-source and will be developed as a set of components that can be used on their own or integrated into RNA-seq processing workflows. In particular we will integrate the components into the Galaxy cloud computing framework hosted on a local server. As such the methods will be available to researchers worldwide. As components mature they may be installed in other servers worldwide to provide a convenient and secure way to analyze transcriptomes. Unveiling the dynamics of the transcriptome at modest cost will revolutionize cellular diagnostics and biomedical research. Genome-wide measurement of transcription variants offers the potential for detailed molecular information about cellular identity and function that will greatly expand traditional histological assessment. Cloud-based access to the methods can turn individual laboratories into small genome centers and will enable individual scientists to assess differences among RNA transcriptomes in a matter of days. Our suite of algorithms will enable biomedical researchers to prioritize candidate genes or different gene ontology categories to investigate further for differential transcription and mechanistic importance between experimental conditions.

Public Health Relevance

This project will develop new techniques to compare the transcriptome of different cells, sampled using high-throughput sequencing technology. As the transcriptome consists of the RNA instructions controlling the cell's functions, differential analysis can provide detailed insight into the difference between healthy and diseased cells, or can be used to more clearly discover the functional differences between, say, liver cells and brain cells.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006272-02
Application #
8473250
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Feingold, Elise A
Project Start
2012-05-23
Project End
2015-03-31
Budget Start
2013-04-01
Budget End
2014-03-31
Support Year
2
Fiscal Year
2013
Total Cost
$395,553
Indirect Cost
$72,723
Name
University of North Carolina Chapel Hill
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
608195277
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599
Huang, Yan; Hu, Yin; Liu, Jinze (2014) Piecing the puzzle together: a revisit to transcript reconstruction problem in RNA-seq. BMC Bioinformatics 15 Suppl 9:S3
Slevin, Michael K; Meaux, Stacie; Welch, Joshua D et al. (2014) Deep sequencing shows multiple oligouridylations are required for 3' to 5' degradation of histone mRNAs on polyribosomes. Mol Cell 53:1020-30
Simon, Jeremy M; Hacker, Kathryn E; Singh, Darshan et al. (2014) Variation in chromatin accessibility in human kidney cancer links H3K36 methyltransferase loss with widespread RNA processing defects. Genome Res 24:241-50
Wang, P; Dong, Q; Zhang, C et al. (2013) Mutations in isocitrate dehydrogenase 1 and 2 occur frequently in intrahepatic cholangiocarcinomas and share hypermethylation targets with glioblastomas. Oncogene 32:3091-100
Jeck, William R; Sorrentino, Jessica A; Wang, Kai et al. (2013) Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA 19:141-57
Hu, Yin; Huang, Yan; Du, Ying et al. (2013) DiffSplice: the genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res 41:e39