Recent evidence has shown that non-coding RNAs are ubiquitous in the cell and that their functions and structure vary to a greater extent than previously imagined. Multiple new RNA classes have been implicated in many diseases, and understanding how these RNAs work is a critical need. While exciting discoveries are accumulating, our functional knowledge of these new RNAs remains limited. Here we propose to couple a new high-throughput RNA duplex sequencing technology with new, computational methods to economically study novel functional non-coding RNA at a genomic scale. We propose to develop two computational methodologies to characterize putative newly found non-coding RNAs on the genomic scale. First, we will develop a maximum likelihood approach that estimates RNA secondary structure using RNA-seq assays that preferentially sequence single- or double-stranded nucleotides. Second, we will develop a machine-learning framework that predicts the functional category of novel non-coding RNAs using length and structure features of known RNAs. These structural and functional predictions will be validated by comparative genomics and experimentation. We will develop databases and analysis software, and investigate the human genome and five other model organisms. In total, our findings will yield tremendous insights into non-coding RNA biology and will substantially impact continued study of these important molecules.

Public Health Relevance

We propose to develop computational methods to study novel non-coding RNA transcripts by leveraging a new duplex RNA sequencing technique. Our first objective is to develop a maximum likelihood algorithm that estimates secondary structure using double-stranded or single-stranded RNA sequencing. We will also develop a machine-learning framework that predicts the functional category of novel non-coding RNAs using length and structure features from RNA-seq experiments. These methods will be used to annotate all RNA transcripts using experimental data from human and five model organisms.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brazhnik, Paul
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Schools of Medicine
United States
Zip Code
Berkowitz, Nathan D; Silverman, Ian M; Childress, Daniel M et al. (2016) A comprehensive database of high-throughput sequencing-based RNA secondary structure probing data (Structure Surfer). BMC Bioinformatics 17:215
Leung, Yuk Yee; Kuksa, Pavel P; Amlie-Wolf, Alexandre et al. (2016) DASHR: database of small human noncoding RNAs. Nucleic Acids Res 44:D216-22
Hwang, Yih-Chii; Lin, Chiao-Feng; Valladares, Otto et al. (2015) HIPPIE: a high-throughput identification pipeline for promoter interacting enhancer elements. Bioinformatics 31:1290-2
Amlie-Wolf, Alexandre; Ryvkin, Paul; Tong, Rui et al. (2015) Transcriptomic Changes Due to Cytoplasmic TDP-43 Expression Reveal Dysregulation of Histone Transcripts and Nuclear Chromatin. PLoS One 10:e0141836
Mirarab, Siavash; Nguyen, Nam; Guo, Sheng et al. (2015) PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. J Comput Biol 22:377-86
Vandivier, Lee E; Campos, Rafael; Kuksa, Pavel P et al. (2015) Chemical Modifications Mark Alternatively Spliced and Uncapped Messenger RNAs in Arabidopsis. Plant Cell 27:3024-37
Ryvkin, Paul; Leung, Yuk Yee; Ungar, Lyle H et al. (2014) Using machine learning and high-throughput RNA sequencing to classify the precursors of small non-coding RNAs. Methods 67:28-35
Leung, Yuk Yee; Ryvkin, Paul; Ungar, Lyle H et al. (2013) CoRAL: predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Res 41:e137
Ryvkin, Paul; Leung, Yuk Yee; Silverman, Ian M et al. (2013) HAMR: high-throughput annotation of modified ribonucleotides. RNA 19:1684-92
Hwang, Yih-Chii; Zheng, Qi; Gregory, Brian D et al. (2013) High-throughput identification of long-range regulatory elements and their target promoters in the human genome. Nucleic Acids Res 41:4835-46