Continued Development and Maintenance of the MG-RAST Metagenomics Pipeline

Grama, Ananth; Meyer, Folker

Abstract

Metagenomics, the study of microbial populations sampled directly from the environment, affords avenues for discovering novel enzymes via microbial profiling; using microbial shifts as predictors for health; or gauging the sustainabilityof human operations like mineral mining. However, the volume of metagenomic data is large (e.g., the metagenome of a human's gut microbiota is about 1 Gigabasepairs in size) and the processing that needs to be done to extract meaning out of the large datasets is significant, such as to identify what organisms' genomes are in the sample (taxonomic annotation) and what are they doing (functional annotation) via comparisons with continually updated knowledge databases. These numbers are only growing as experimentalists demand more and more metagenomic analysis runs. Borne out of this need, our MG-RAST (Metagenomics-Rapid Annotation) portal, an open-source, high-throughput, metagenomics service, has been a major community resource since 2008, housing over 160K datasets and 40K users. However, since its original design, MG-RAST has witnessed the frenetic development of next-generation sequencing technologies, drastically altered computing landscape (both in hardware and software), changed requirements in terms of number of users and datasets' volumes and diversity, increasing complexity of pipeline components, and requirements for higher throughput. To adapt to this, MG-RAST has been continually modified. Modifications included upgrading the pipeline components with several algorithmic improvements; deploying a customized data and workflow management system - the SHOCK object store and AWE workflow manager; and porting MG-RAST to a cloud-based distributed architecture. Notwithstanding our continual, albeit ad-hoc system improvements, our pilot studies have indicated the need for a comprehensive redesign of MG-RAST to keep pace with the needs of the rapidly advancing field of metagenomics. Our proposed enhancements are based on expressed user requirements, new usage patterns, and flexibility to incorporate new tools, especially for the compute-intensive similarity analysis for queried sequences. Through this project, we propose to accomplish MG-RAST's transformation via (i) improving its functionality and data reproducibility; (ii) improving its software quality and performance through automated monitoring and generation of test suites; and (iii) moving toward a federated infrastructure for metagenomics data. Overall, the successful accomplishment of our aims will support alternate metagenomics service models through federation of services and data and result in a robust state-of-the-art metagenomics resource. Federation in biomedical pipelines is in general a powerful direction to leverage the expertise of diverse user-bases and, reciprocally, benefit its users. Thus, MG-RAST, as a state- of-the-art pipeline, will be capable of supporting an ever increasing user-base, handling larger and more varied datasets, and evolving in concert with new genomics technologies. This, with the ultimate goal, to accelerate advances in end-user applications, e.g., personalized medicine, tailored to the patient's microbiome.

Public Health Relevance

Analysis of metagenomic data, i.e., genetic material recovered from environmental samples, has tremendous potential for advances in diverse clinical and ecological applications. We have built and maintained since 2008 an open portal for metagenomic data processing called MG-RAST. Here we present a plan for continued development and maintenance of MG-RAST to support a larger and more diverse user-base and a greater diversity of datasets and processing tools in the pipeline.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of Allergy and Infectious Diseases (NIAID)
Type: Research Project (R01)
Project #: 5R01AI123037-05
Application #: 9906157
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Brown, Liliana L

Project Start: 2016-03-01
Project End: 2021-02-28
Budget Start: 2020-03-01
Budget End: 2021-02-28
Support Year: 5
Fiscal Year: 2020
Total Cost
Indirect Cost

Institution

Name: Purdue University
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 072051394

City: West Lafayette
State: IN
Country: United States
Zip Code: 47907

Related projects


NIH 2020 R01 AI	Continued Development and Maintenance of the MG-RAST Metagenomics Pipeline Grama, Ananth; Meyer, Folker / Purdue University
NIH 2019 R01 AI	Continued Development and Maintenance of the MG-RAST Metagenomics Pipeline Grama, Ananth; Meyer, Folker / Purdue University
NIH 2018 R01 AI	Continued Development and Maintenance of the MG-RAST Metagenomics Pipeline Grama, Ananth; Meyer, Folker / Purdue University
NIH 2017 R01 AI	Continued Development and Maintenance of the MG-RAST Metagenomics Pipeline Grama, Ananth; Meyer, Folker / Purdue University	$743,525
NIH 2016 R01 AI	Continued Development and Maintenance of the MG-RAST Metagenomics Pipeline Grama, Ananth; Meyer, Folker / Purdue University

Publications

Ghoshal, Asish; Zhang, Jinyi; Roth, Michael A et al. (2018) A Distributed Classifier for MicroRNA Target Prediction with Validation Through TCGA Expression Data. IEEE/ACM Trans Comput Biol Bioinform 15:1037-1051

Ten Hoopen, Petra; Finn, Robert D; Bongo, Lars Ailo et al. (2017) The metagenomic data life-cycle: standards and best practices. Gigascience 6:1-11

Bowers, Robert M; Kyrpides, Nikos C; Stepanauskas, Ramunas et al. (2017) Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol 35:725-731

Thompson, Luke R; Sanders, Jon G; McDonald, Daniel et al. (2017) A communal catalogue reveals Earth's multiscale microbial diversity. Nature 551:457-463

Meyer, Folker; Bagchi, Saurabh; Chaterji, Somali et al. (2017) MG-RAST version 4-lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Brief Bioinform :

Chaterji, Somali; Ahn, Eun Hyun; Kim, Deok-Ho (2017) CRISPR Genome Engineering for Human Pluripotent Stem Cell Research. Theranostics 7:4445-4469

Wilke, Andreas; Bischof, Jared; Gerlach, Wolfgang et al. (2016) The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res 44:D590-4

Magner, Abram; Duda, Jaros?aw; Szpankowski, Wojciech et al. (2016) Fundamental Bounds for Sequence Reconstruction from Nanopore Sequencers. IEEE Trans Mol Biol Multiscale Commun 2:92-106

Kim, Seong Gon; Harwani, Mrudul; Grama, Ananth et al. (2016) EP-DNN: A Deep Neural Network-Based Global Enhancer Prediction Algorithm. Sci Rep 6:38433

Comments

Be the first to comment on Ananth Grama's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: