Metagenomics, the study of microbial populations sampled directly from the environment, affords avenues for discovering novel enzymes via microbial profiling; using microbial shifts as predictors for health; or gauging the sustainabilityof human operations like mineral mining. However, the volume of metagenomic data is large (e.g., the metagenome of a human's gut microbiota is about 1 Gigabasepairs in size) and the processing that needs to be done to extract meaning out of the large datasets is significant, such as to identify what organisms' genomes are in the sample (taxonomic annotation) and what are they doing (functional annotation) via comparisons with continually updated knowledge databases. These numbers are only growing as experimentalists demand more and more metagenomic analysis runs. Borne out of this need, our MG-RAST (Metagenomics-Rapid Annotation) portal, an open-source, high-throughput, metagenomics service, has been a major community resource since 2008, housing over 160K datasets and 40K users. However, since its original design, MG-RAST has witnessed the frenetic development of next-generation sequencing technologies, drastically altered computing landscape (both in hardware and software), changed requirements in terms of number of users and datasets' volumes and diversity, increasing complexity of pipeline components, and requirements for higher throughput. To adapt to this, MG-RAST has been continually modified. Modifications included upgrading the pipeline components with several algorithmic improvements; deploying a customized data and workflow management system - the SHOCK object store and AWE workflow manager; and porting MG-RAST to a cloud-based distributed architecture. Notwithstanding our continual, albeit ad-hoc system improvements, our pilot studies have indicated the need for a comprehensive redesign of MG-RAST to keep pace with the needs of the rapidly advancing field of metagenomics. Our proposed enhancements are based on expressed user requirements, new usage patterns, and flexibility to incorporate new tools, especially for the compute-intensive similarity analysis for queried sequences. Through this project, we propose to accomplish MG-RAST's transformation via (i) improving its functionality and data reproducibility; (ii) improving its software quality and performance through automated monitoring and generation of test suites; and (iii) moving toward a federated infrastructure for metagenomics data. Overall, the successful accomplishment of our aims will support alternate metagenomics service models through federation of services and data and result in a robust state-of-the-art metagenomics resource. Federation in biomedical pipelines is in general a powerful direction to leverage the expertise of diverse user-bases and, reciprocally, benefit its users. Thus, MG-RAST, as a state- of-the-art pipeline, will be capable of supporting an ever increasing user-base, handling larger and more varied datasets, and evolving in concert with new genomics technologies. This, with the ultimate goal, to accelerate advances in end-user applications, e.g., personalized medicine, tailored to the patient's microbiome.

Public Health Relevance

Analysis of metagenomic data, i.e., genetic material recovered from environmental samples, has tremendous potential for advances in diverse clinical and ecological applications. We have built and maintained since 2008 an open portal for metagenomic data processing called MG-RAST. Here we present a plan for continued development and maintenance of MG-RAST to support a larger and more diverse user-base and a greater diversity of datasets and processing tools in the pipeline.

Agency
National Institute of Health (NIH)
Institute
National Institute of Allergy and Infectious Diseases (NIAID)
Type
Research Project (R01)
Project #
5R01AI123037-05
Application #
9906157
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brown, Liliana L
Project Start
2016-03-01
Project End
2021-02-28
Budget Start
2020-03-01
Budget End
2021-02-28
Support Year
5
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Purdue University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
072051394
City
West Lafayette
State
IN
Country
United States
Zip Code
47907
Ghoshal, Asish; Zhang, Jinyi; Roth, Michael A et al. (2018) A Distributed Classifier for MicroRNA Target Prediction with Validation Through TCGA Expression Data. IEEE/ACM Trans Comput Biol Bioinform 15:1037-1051
Ten Hoopen, Petra; Finn, Robert D; Bongo, Lars Ailo et al. (2017) The metagenomic data life-cycle: standards and best practices. Gigascience 6:1-11
Bowers, Robert M; Kyrpides, Nikos C; Stepanauskas, Ramunas et al. (2017) Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol 35:725-731
Thompson, Luke R; Sanders, Jon G; McDonald, Daniel et al. (2017) A communal catalogue reveals Earth's multiscale microbial diversity. Nature 551:457-463
Meyer, Folker; Bagchi, Saurabh; Chaterji, Somali et al. (2017) MG-RAST version 4-lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Brief Bioinform :
Chaterji, Somali; Ahn, Eun Hyun; Kim, Deok-Ho (2017) CRISPR Genome Engineering for Human Pluripotent Stem Cell Research. Theranostics 7:4445-4469
Kim, Seong Gon; Harwani, Mrudul; Grama, Ananth et al. (2016) EP-DNN: A Deep Neural Network-Based Global Enhancer Prediction Algorithm. Sci Rep 6:38433
Wilke, Andreas; Bischof, Jared; Gerlach, Wolfgang et al. (2016) The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res 44:D590-4
Magner, Abram; Duda, Jaros?aw; Szpankowski, Wojciech et al. (2016) Fundamental Bounds for Sequence Reconstruction from Nanopore Sequencers. IEEE Trans Mol Biol Multiscale Commun 2:92-106