The J. Craig Venter Institute is awarded a grant to develop MGTAXA, a freely available software and a Web server for taxonomic classification of metagenomic sequences with machine learning techniques. This project will build three major components: 1) a toolbox for reliable assignment of species composition to large collections of unassembled environmental sequencing data, with automated and regular updates of databases and models; 2) a public Web server with a high-performance computational back-end that will let a wide community of biologists build classification models specific to their metagenomic samples; 3) an online instructional environment where students and educators will interactively combine several machine learning algorithms into graphically represented pipelines, apply them to sequences from annotated genomes and contribute to the re-usable repository of exercises and small research projects. The tools developed by this project will help both individual biologists and experienced bioinformatics teams analyze their metagenomic data for the discovery of novel genes, proteins, and metabolic pathways in microorganisms that cannot be grown in the laboratory conditions. This basic scientific research of our living environment will ultimately benefit the public by providing a necessary foundation for applied areas of study such as alternative energy sources and new medicines. The first question that needs to be answered by any metagenomic study is what species or higher taxonomic units are present in the sample, and to bin individual sequences to these units. The novel methodology of this project will require neither an existing homology to known sequences nor a preliminary assembly of individual fragments into longer segments. It also frees its users from a complexity of data management and installation that is beyond the abilities of smaller research groups. The free interactive online learning interface will provide both a hands-on experience and a curriculum development tool for students and teachers from colleges and high-schools, regardless of their geographical location. Source code of the tools developed by this project will be available at the open source development site SourceForge (http://sourceforge.net/projects/mgtaxa/). Web services will be available through a variety of venues: the JCVI web site (www.jcvi.org/) and the TeraGrid (www.teragrid.org). Certain tools will be submitted for inclusion into existing bioinformatics services Galaxy (http://galaxy.psu.edu) and CAMERA (http://camera.calit2.net).

Project Report

This project has developed MGTAXA, a freely available software and a Web server for predicting what organisms are present in the environment. The user uploads fragments of genomic sequences obtained by directly sampling a specific environment such as marine water or human gut. The server compares the uncharacterized user sequences with the database of sequences from known organisms and labels the input sequences. The project addressed the following intellectual merits: A novel method was created for predicting which viruses infect which microbes (host range). Viruses that infect microbes are called bacteriophages. Such viruses control the microbial populations and play major role in the circulation of organic matter in the oceans, ultimately influencing global food production and carbon emissions. They also influence microbial populations in human microbiome, soil and elsewhere. This project has created the first method that can predict the hosts for short fragments of bacteriophage sequences based on overal similarity with database bacterial sequences. The MGTAXA software that this project has developed predicts both the names of the organisms for all domains of life and the names of viral hosts. The software can run in parallel on thousands of processors to handle large input datasets quickly. It can run on multiple hardware architectures including high performance parallel NSF XSEDE (former TeraGrid) supercomputers. The Web server at J. Craig Venter Institute is built on widely used Galaxy bioinformatics workbench. It provides flexible interface to MGTAXA tools. Users can supplement the database sequences with their own reference data before classifying their fragments. Users can build workflows from the provided tools. The project has applied its methods to study large environmental datasets generated at the Venter Institute, using both local computational resources as well as NSF XSEDE. The datasets ranged from the largest survey of the marine microbes to the human microbiome datasets. The broader impacts of the project: The tools developed by this project help biologists analyze their environmental genomic data for the discovery of novel genes, proteins, and metabolic pathways in microorganisms that cannot be grown in the laboratory conditions. This basic scientific research of our living environment ultimately benefits the public by providing a necessary foundation for applied studies of geochemical cycles, food production, and association between microbial flora and health. Using the developed interactive Web server environment, the project has trained high school student intern in the design and application of statistical tools for the analysis of association between environmental sequences and aquatic conditions such as water temperature, salinity and oxygen concentration. Software products: The source code of all tools developed by the project is available at the open source development sites GitHub and BitBucket referenced at the project Web site at: http://andreyto.github.com/mgtaxa/. The Web server is available at: http://mgtaxa.jcvi.org/ The specific software products are: MGTAXA viral host prediction and metagenomic classification algorithms http://andreyto.github.com/mgtaxa/ NCBI BLAST+ parallelized with MapReduce MPI for HPC clusters http://github.com/andreyto/mr-mpi-blast Self-Organizing Maps algorithm parallelized with MapReduce MPI for HPC clusters http://github.com/andreyto/mr-mpi-som Source code for the customized distribution of the Galaxy bioinformatics workbench used by the MGTAXA Web front-end https://bitbucket.org/andreyto/mgtaxa-galaxy Firewall-friendly connector between Galaxy workbench and GRIDWAY metascheduler https://bitbucket.org/andreyto/gridway-proxy-mad JavaScript HTML tool for presenting the predicted taxonomic composition of multiple metagenomic samples on a zoomable geographic map, along with enviromental variables associated with the samples. https://bitbucket.org/andreyto/mgtaxa-chart-map

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
0850256
Program Officer
Peter H. McCartney
Project Start
Project End
Budget Start
2009-04-15
Budget End
2012-07-31
Support Year
Fiscal Year
2008
Total Cost
$809,384
Indirect Cost
Name
J. Craig Venter Institute, Inc.
Department
Type
DUNS #
City
Rockville
State
MD
Country
United States
Zip Code
20850