Recent evidence has shown that non-coding RNAs are ubiquitous in the cell and that their functions and structure vary to a greater extent than previously imagined. Multiple new RNA classes have been implicated in many diseases, and understanding how these RNAs work is a critical need. While exciting discoveries are accumulating, our functional knowledge of these new RNAs remains limited. Here we propose to couple a new high-throughput RNA duplex sequencing technology with new, computational methods to economically study novel functional non-coding RNA at a genomic scale. We propose to develop two computational methodologies to characterize putative newly found non-coding RNAs on the genomic scale. First, we will develop a maximum likelihood approach that estimates RNA secondary structure using RNA-seq assays that preferentially sequence single- or double-stranded nucleotides. Second, we will develop a machine-learning framework that predicts the functional category of novel non-coding RNAs using length and structure features of known RNAs. These structural and functional predictions will be validated by comparative genomics and experimentation. We will develop databases and analysis software, and investigate the human genome and five other model organisms. In total, our findings will yield tremendous insights into non-coding RNA biology and will substantially impact continued study of these important molecules.

Public Health Relevance

We propose to develop computational methods to study novel non-coding RNA transcripts by leveraging a new duplex RNA sequencing technique. Our first objective is to develop a maximum likelihood algorithm that estimates secondary structure using double-stranded or single-stranded RNA sequencing. We will also develop a machine-learning framework that predicts the functional category of novel non-coding RNAs using length and structure features from RNA-seq experiments. These methods will be used to annotate all RNA transcripts using experimental data from human and five model organisms.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM099962-02
Application #
8545184
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brazhnik, Paul
Project Start
2012-09-14
Project End
2017-05-31
Budget Start
2013-06-01
Budget End
2014-05-31
Support Year
2
Fiscal Year
2013
Total Cost
$301,080
Indirect Cost
$112,905
Name
University of Pennsylvania
Department
Pathology
Type
Schools of Medicine
DUNS #
042250712
City
Philadelphia
State
PA
Country
United States
Zip Code
19104
Ryvkin, Paul; Leung, Yuk Yee; Ungar, Lyle H et al. (2014) Using machine learning and high-throughput RNA sequencing to classify the precursors of small non-coding RNAs. Methods 67:28-35
Leung, Yuk Yee; Ryvkin, Paul; Ungar, Lyle H et al. (2013) CoRAL: predicting non-coding RNAs from small RNA-sequencing data. Nucleic Acids Res 41:e137
Hwang, Yih-Chii; Zheng, Qi; Gregory, Brian D et al. (2013) High-throughput identification of long-range regulatory elements and their target promoters in the human genome. Nucleic Acids Res 41:4835-46
Ryvkin, Paul; Leung, Yuk Yee; Silverman, Ian M et al. (2013) HAMR: high-throughput annotation of modified ribonucleotides. RNA 19:1684-92