Sequencing of DNA and cDNA libraries on """"""""next-generation"""""""" sequencing (NGS) platforms has become the method of choice for genomic and transcriptional analyses. One obstacle that inhibits wider adoption of NGS techniques is the lack of comprehensive, yet easy to use software packages with which to conduct data analysis. To meet this need, we have developed RseqFlow, a set of common analytic modules for the analysis of RNA-seq data which is formalized into an easy to use workflow. The workflow is managed by the Pegasus Workflow Management System (WMS), which maps the modules to available computational resources and automatically executes the steps in the appropriate order. A Virtual Machine (VM) was created for the software package which eliminates complex configuration and installation steps. In this proposal, we plan to extend RseqFlow to include more analytic functions and also to generalize it to work for multiple model organisms including the Mouse, Worm, Fruit fly, Plant and Yeast. We also propose the development of a similar workflow for the analysis of genome re-sequencing data. Both of the workflows will take advantage of several analytic tools we have developed, including PerM (short read alignment), ComB (SNP Calling), Clippers (Indel/Junction detection), and WeaV (de novo assembly). One of the unique features of our workflow is an iterative alignment strategy where sequence variants are used to update the sequence and improve alignment accuracy which in turn affords us the ability to accurately determine not only SNPs and indels but also structural and copy-number variations. A final effort will include combining the workflows for RNA-seq data and genome re-sequencing data to perform RNA editing analysis. All programs developed under this proposal will be rigorously tested on a number of different data sets and on multiple computational platfonns, and use sound software engineering practices. All software released under this proposal will be open source and greatly benefit many biological projects which incorporate DNA and RNA sequencing approaches.

Public Health Relevance

High throughput sequencing (HTS) has been used to study of human genetics and diseases, human microbial communities, and is also a growing analytical tool for clinical trials. The goal of this research is to develop computational software workflows o aid in the analysis of DNA and RNA sequencing data sets. Our open-source software tools will benefit researchers worldwide who use HTS to perform various biological and medical studies.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01HG006531-02
Application #
8416340
Study Section
Special Emphasis Panel (ZHG1-HGR-M (O3))
Program Officer
Sofia, Heidi J
Project Start
2012-02-01
Project End
2014-12-31
Budget Start
2013-01-01
Budget End
2013-12-31
Support Year
2
Fiscal Year
2013
Total Cost
$316,985
Indirect Cost
$107,104
Name
University of Southern California
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90089
Liu, Zehua; Lou, Huazhe; Xie, Kaikun et al. (2017) Reconstructing cell cycle pseudo time-series via single-cell transcriptome data. Nat Commun 8:22
Chen, Emily A; Souaiaia, Tade; Herstein, Jennifer S et al. (2014) Effect of RNA integrity on uniquely mapped reads in RNA-Seq. BMC Res Notes 7:753
Zeng, Feng; Jiang, Rui; Chen, Ting (2013) PyroHMMsnp: an SNP caller for Ion Torrent and 454 sequencing data. Nucleic Acids Res 41:e136
Lehmann, Kjong-Van; Chen, Ting (2013) Exploring functional variant discovery in non-coding regions with SInBaD. Nucleic Acids Res 41:e7
Zeng, Feng; Jiang, Rui; Chen, Ting (2013) PyroHMMvar: a sensitive and accurate method to call short indels and SNPs for Ion Torrent and 454 data. Bioinformatics 29:2859-68