Although invisible to the human eye, microorganisms have immense impact on society and the environment. They keep soil healthy, cause disease, and give us tools like antibiotics to fight the diseases other microbes cause. Accurate and precise identification of microbes is thus essential for understanding microbiology in general, for diagnosis and treatment of diseases, and for maintaining a healthy society and a healthy environment. The DNA sequencing revolution allows us to read the genetic code of individual microbes, and use this for fast and accurate identification. However, we cannot do this without reference databases that precisely define classes of microorganisms, and associate them with their unique characteristics. We also need fast computer programs that can handle the large amounts of data involved and, to be most useful to the world, we need to allow anyone, anywhere to upload the genetic data for the microbes they find, and quickly get an accurate identification of their likely impact. Therefore, scientists will build genomeRxiv, a Web site and a database of hundreds of thousands of accurately catalogued and classified public genome sequences of bacteria and archaea. Building on existing work, a combination of fast and accurate algorithms will be employed for users to query the database. A unique feature will keep submitted genomes private, which will enable and stimulate networking and facilitate sharing of genome sequencing results among scientists across academia, industry, and government, leading to a more efficient, and economically stimulating, use of research funds. Automated design of diagnostic tools will facilitate detection and regulation of pathogens for biosafety and biosecurity and directly impact clinical and veterinary medicine, plant pathology, and the use of beneficial microbes in agriculture. The scientific community will be trained in the use of genomeRxiv, and undergraduate and graduate students of diverse backgrounds will receive education at the interface of biology and computer science.

The number of sequenced genomes is increasing exponentially, but automated assignment of taxonomic identity is constrained by transfer of existing taxonomy. Up to ?20% of the existing classifications are expected to be incorrect, and determined by historically contingent polyphasic tests that do not correspond to meaningful phylogenomic groupings identifiable at genome level. The continued incorrect and inaccurate assignment of taxonomic identities undermines our understanding of prokaryote evolution and diversity, as well as legislative efforts to regulate and monitor pathogens, which are reliant on accurate identification. Tools already developed by the PIs of this project will be improved and integrated into a new computational service called genomeRxiv, which aims to solve these problems. The existing LINbase Web server will serve as the basis for the new genomeRxiv service. The highly resolved Life Identification NumberTM (LINTM) classification framework of PIs Vinatzer and Heath will be combined with the speed and computational efficiency of the sourmash software developed by PI Brown and the precision and filtering ability of the pyani software developed by PI Pritchard. A primer design software developed by PI Pritchard will be integrated into genomeRxiv to provide users with the ability to quickly design precise molecular detection tools.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
2018911
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2020-08-01
Budget End
2023-07-31
Support Year
Fiscal Year
2020
Total Cost
$297,294
Indirect Cost
Name
University of California Davis
Department
Type
DUNS #
City
Davis
State
CA
Country
United States
Zip Code
95618