The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a semantically enabled database and distributed computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing, and sharing microbial biology data. With the rapid advance of newer DNA sequencing methods, so called Next Generation Sequencing (NGS) technologies, such as Illumina HiSeq and MiSeq, it is becoming increasingly difficult for researchers using sequencing data to meet the computing requirements for large-scale NGS datasets with existing methods. In response to these aspects of the BIG DATA challenge, the CAMERA team is developing new bioinformatics algorithms, high performance computing solutions, visualization interfaces, and data resources to specifically address the NGS data analysis challenges. Here, the group proposes a crosscutting methodology for analyzing NGS data that marries innovative bioinformatics algorithms and workflows with leading edge computational methods for managing large scale distributed computing. The integration of XSEDE resources for BIG DATA analysis will provide the scale and specification necessary to drive the development of this system. This project will be conducted over two years. Year one will be focused on the refinement of core CAMERA CI (e.g. Panfish) and the continued development of core NGS workflows/algorithms. Specifically, CAMERA CI will be extended to take full advantage of two new NSF XSEDE resources to be commissioned in early 2015 (Wrangler at TACC & Comet at SDSC). Year 2 will be focused on the production integration of Wrangler and Comet and the subsequent deployment of the NGS workflows (via CAMERA CI) to the entire CAMERA community. These new software tools and pipelined processes facilitate the processing and analyze very large-scale metagenomic data on the scale of tens of GB per sample and provide comprehensive and unique functions such as 16S analysis[7], taxonomy binning[8], assembly, rRNA finding, clustering, filtering, function and pathway annotation, and visualization]. These next generation tools enable orders of magnitude faster computational process, more comprehensive analysis, integrated data output, and novel ways to investigate complex data, once made to operate in extensible HPC cloud environments. The Broader Impact is viewed as that currently, manual operations are necessary to complete analysis with these tools due to the complexity of the process and the large number of software tools involved. The goal of this project is to develop a series of fully integrated and easy-to-use analysis workflows encapsulating these tools. These new workflows of software tools will significantly improve NGS data analysis for researchers who use metagenomics as an investigative tool, researchers who are now impeded by challenges with regard to managing and analyzing BIG DATA.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1419196
Program Officer
Robert Chadduck
Project Start
Project End
Budget Start
2014-02-15
Budget End
2016-01-31
Support Year
Fiscal Year
2014
Total Cost
$250,000
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093