Bioinformatic analysis of large genomic datasets is a critical barrier for many biologists, especially those at smaller research institutions. Leveraging our team's bioinformatics experience, our goal is to develop an interactive web application that can be used to easily translate RNA sequencing data into biological insights. We hypothesized that an integrated tool for reproducible, in-depth analysis of expression data will democratize access to high-throughput technologies and help biologists pinpoint molecular pathways from large data. Our goal is to develop a carefully-designed user-friendly pipeline with rich data visualization capacity. As a proof of concept, the team developed a prototype called iDEP (integrated Differential Expression and Pathway analysis) for the analysis of summarized expression matrices. It's unique features include (1) comprehensive analytic functionality based on 63 R and Bioconductor packages, covering exploratory data analysis, clustering, differential gene expression and pathway analysis; (2) a massive knowledgebase for automatic gene ID conversion, annotation, and pathway analysis for over 2000 archaeal, bacterial and eukaryotic species; (3) reproducibility of some core steps by generating R and R Markdown notebooks; (4) application programming interfaces (APIs) for retrieval of protein-protein interaction networks and KEGG pathway diagrams, and (5) easy access to about 13000 processed public RNA-seq data in 9 species. Compared with existing tools, the key innovation is the emphasis on deep integration (tools, annotation, pathways, and public datasets), user- friendliness, and reproducibility. Even with limited features, iDEP is beginning to be adopted by researchers from diverse fields. In this proposal, the team plans to complete the development of iDEP. The goal of Specific Aim 1 is to (a) re-write iDEP in a modular, object-oriented fashion, (b) make an R package for generating fully reproducible R Markdown notebooks, and (c) add essential functionalities such as bias correction (batch effect, GC content, gene length, expression level), time-course analysis, supervised classification, and additional methods for existing functional modules. We will also enable gene ontology enrichment analysis for unannotated species using Blast2GO.
Specific Aim 2 focuses on (a) substantially expanding the pathway database for frequently studied species and (b) collecting more uniformly processed RNA-seq and DNA microarray datasets to facilitate the re-analysis and meta-analysis of public expression data.
In Specific Aim 3, the team will conduct hardware upgrade, rigorous testing, code review, documentation, and community integration. The development of iDEP can help make standard RNA-seq analysis accessible for a very broad community of researchers.

Public Health Relevance

RNA sequencing is a widely used technique for surveying the activities of tens of thousands of genes in healthy or diseased tissues. We will develop a web application to help researchers with visualization, statistical analysis and biological interpretation of the large datasets.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG010805-01A1
Application #
9978200
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Gilchrist, Daniel A
Project Start
2020-09-02
Project End
2024-06-30
Budget Start
2020-09-02
Budget End
2021-06-30
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
South Dakota State University
Department
Biostatistics & Other Math Sci
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
929929743
City
Brookings
State
SD
Country
United States
Zip Code
57007