The diversity of prokaryotic microbes on the planet is very large, estimated at over a billion species of bacteria, and most of it remains undiscovered. As genome sequencing can help characterizing this diversity and has recently become routine, most microbial scientists have been overwhelmed by the amount of available data. For many researchers, a complete and thorough analysis of the available genome data is not necessary. Instead, these users would be better served with straightforward methods that can identify their unknown DNA/RNA sequence(s), and be able to discriminate between well-understood taxa and those that are potentially novel. Tools that can help direct researchers to the most "interesting" genomes and genes among thousands of candidates will be important. Furthermore, metagenomics, which is the sequencing of environmental DNA, allows the genetic characterization of the majority of these prokaryotes that have resisted cultivation in the laboratory. However, current tools to analyze metagenomic data are clearly lagging behind the development of sequencing technologies (and data processing tools), and are typically limited to genome assembly and gene annotation. This is a major limitation for better understanding, studying and communicating about the biodiversity of uncultivated microorganisms that run the life-sustaining biogeochemical cycles on the planet, form critical associations with their plant and animal hosts, or produce products of biotechnological value. Therefore, new approaches to make the emerging genomic and metagenomic sequence information readily available to the non-expert user are timely and essential in order to advance our understanding of the diversity and function of microbial communities across the fields of ecology, systematics, evolution, engineering, agriculture and medicine.

The overarching goal of this project is to develop a computational framework to organize and catalogue the genomic and metagenomic information of the "uncultivated majority" of bacteria and archaea that will scale-up with the available and forthcoming metagenomic data sets. To achieve this goal, the Microbial Genomes Atlas project (MiGA), which was originally designed to catalogue the diversity among isolate genomes, will be used and expanded with new tools and computational solutions that will allow it to encompass the "uncultivated majority". The new tools include -but are not limited to- a kmer-based approach to replace the alignment-based genome-average nucleotide identity (ANI) metric for estimating genetic relatedness among genomes and high-throughput tools for assessing genome completeness and quality, the pangenome of a group of genomes, and the level of genetic variation within natural sequence-discrete populations. This research effort will address critical limitations of the current 16S rRNA gene-based approaches in cataloging microbial diversity (e.g., lack of resolution at the species level and below), and will provide highly needed tools for the online analysis of high throughput genomic and metagenomic data from uncultivated microorganisms. The project will also provide multifaceted learning experiences to both undergraduate and graduate students at the interface of microbiology, evolution, genomics, bioinformatics, and computational biology, a pivotal area of contemporary research and education. See MiGA's website for further details at: www.microbial-genomes.org .

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1759831
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2018-09-01
Budget End
2021-08-31
Support Year
Fiscal Year
2017
Total Cost
$598,895
Indirect Cost
Name
Georgia Tech Research Corporation
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30332