Increasing amounts of sequence data are being generated in multiple biomedical research disciplines, particularly through the application of next-generation sequencing technologies to the genomic analysis of humans and their associated microbiome. However, the bioinformatic infrastructure necessary for sequence processing, requiring demanding software installations and access to powerful CPUs, currently presents a dramatic bottleneck for the further expansion of research in this field. Our proposal aims to address this problem through the generation of a portable, stand-alone Virtual Machine (VM) software package that combines all essential tools to perform basic Metagenomic, Viral and Eukaryotic sequence analysis. This VM will allow researchers to easily implement entire software pipelines locally or on server networks, independently of the operating system that is being used and without further installation steps. Furthermore, due to the mobility of the VM package, researchers will have the opportunity to outsource compute-intensive processing steps to external Cloud Computing networks that exist as free academic and commercial services. As part of the proposed project, we will assemble bioinformatics analysis pipelines for metagenomic, viral and eukaryotic genome sequencing projects with relevance to research on the human microbiome and the human host (Aim 1). Supported applications will include 16S rRNA-based phylogeny and community comparison, and identification, assembly, annotation and functional characterization of viral, bacterial and eukaryotic sequences from metagenomic HMP samples. In addition, pipelines for other increasingly relevant applications in this field will be provided for the removal of human DNA from HMP samples, eukaryotic and prokaryotic DNA and RNA sequence mapping, including SNP and splice site identification and de novo sequence assembly and annotation of eukaryotic microbes from the human microbiome, such as fungi and protists. Integrated analysis pipelines will be packaged into a VM, creating a stand-alone, push-button sequence analysis package (Aim 2), which will be optimized for performance on commercial and academic Clouds (Aim 3). Objective measures of the suitability of Clouds for next-generation sequence analysis will be determined and include runtime performance metrics, such as execution time and relative speedup, requirements on disk storage, memory, and data transfer throughput to and from Clouds. Outreach to the broad user community, including key collaborators utilizing Cloud platforms, will promote interoperability and development of standards for performing sequence analysis on Cloud platforms (Aim 4). Extensive documentation will be provided, including interactive online training seminars (webinars). User satisfaction, including software ease of use, quality of documentation, and value of instructional seminars will be surveyed. An advisory board will be established and used in reviewing all user statistics, performance metrics, technical developments and overall success of the project in conjunction with the Program Officer.
von Rosenvinge, Erik C; Song, Yang; White, James R et al. (2013) Immune status, antibiotic medication and pH are associated with changes in the stomach fluid microbiota. ISME J 7:1354-66 |
Angiuoli, Samuel V; White, James R; Matalka, Malcolm et al. (2011) Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing. PLoS One 6:e26624 |
Angiuoli, Samuel V; Matalka, Malcolm; Gussman, Aaron et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 12:356 |