Advancing our knowledge of the extent and nature of biodiversity on Earth is a fundamental challenge in Biology. The largest uncharacterized reservoir of phylogenetic diversity on the planet resides in cryptic lineages of Bacteria and Archaea, often referred to as "microbial dark matter", that cannot be cultivated in the laboratory and are only known to exist from cultivation-independent molecular methods. Understanding the physiology, evolutionary history, and environmental impact of these groups is a major frontier for future research, and studies in this area have already led to important discoveries in biogeochemistry, evolutionary biology, and cellular physiology. Advances in high-throughput DNA sequencing together with developments in the field of metagenomics have provided an unprecedented amount of metagenomic data that can now be mined to discover novel microbial lineages, but new computational methods are required for leveraging this "big data" to achieve these biological insights. This project focuses on developing computational tools to assess microbial diversity in metagenomic data through analysis of sequences of RNA polymerase (RNAP). RNAP is a high-resolution phylogenetic marker gene that provides accurate taxonomic assignments for Bacteria and Archaea and can be leveraged to identify and classify cryptic microbial lineages in the biosphere. This work will also involve the training of graduate and undergraduate students in computational biology and bioinformatics as well as the integration of these bioinformatic approaches into university curricula.

Advancing our knowledge of the extent and nature of biodiversity on Earth is a fundamental goal in biology, and currently the largest uncharacterized reservoir of phylogenetic diversity on the planet consists of cryptic lineages of Bacteria and Archaea that are often referred to as "microbial dark matter". Recent advances in high-throughput DNA sequencing and metagenomics provide an unprecedented amount of "'-omic" data that can be used to assess the diversity and distributions of these lineages in the environment, but traditional approaches relying on small subunit rRNA genes are not compatible with these data due to complications that arise from sequence assembly. Given the massive quantity of metagenomic data currently available, there is an urgent need for the development of accurate and standardized bioinformatic workflows for the analysis of microbial diversity that are compatible with current metagenomic datasets. The goal of this work is to integrate research and teaching activities focused on developing software for the prediction and phylogenetic characterization of RNAP sequences in metagenomic datasets. This will be addressed this by 1) development and validation of open source software that uses custom Hidden Markov Models, gene neighborhood assessment, and machine learning algorithms to accurately predict RNAP genes from metagenomic data, and 2) compilation of an RNAP database for the fast and accurate phylogenetic classification of novel RNAP sequences. RNAP is the best available phylogenetic marker to use in metagenomic data, and this work will therefore provide a critical framework for the assessment of microbial phylogenetic diversity and evolutionary relationships. The impact of the proposed work will be maximized through the open-access distribution of software and data products on GitHub and the Virginia Tech library. Implementation of the software will be integrated into graduate and undergraduate courses at Virginia Tech to increase student involvement and participation. Additionally, this research will be conducted by PhD and undergraduate students, several of whom are underrepresented minorities. The professional development of these students will be prioritized through close mentorship by the PIs and attendance of scientific conferences and career-building workshops. Results of this research will be provided on the Aylward Lab website (www.aylwardlab.com) and GitHub site (https://github.com/faylward)

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1918271
Program Officer
Jean Gao
Project Start
Project End
Budget Start
2019-08-01
Budget End
2022-07-31
Support Year
Fiscal Year
2019
Total Cost
$555,496
Indirect Cost
City
Blacksburg
State
VA
Country
United States
Zip Code
24061