Almost every environment on Earth is home to communities of microbes, whose metabolism profoundly influences all other life. Understanding such communities is therefore important to many fields, such as agriculture, biogeochemistry, oceanography, biology, ecology, and so on. However, the overwhelming majority of environmental microbes have never been isolated and grown in the laboratory. Therefore, a key to understanding microbial communities is the resolution of genomes from culture-independent sequencing (metagenomics), where the genomes of individual species are reconstructed from a mixed metagenome in a process called "binning". However, inaccurate binning hampers the ability to gain fundamental insights into microbial communities, for example in situations where it is uncertain whether all of a particular genome has been recovered from the dataset, or that all sequences in a bin are really from the same genome. In this regard, existing binning methods have various problems, including (1) poor performance with highly complex metagenomes, (2) assumption that all input sequences are bacterial, leading to impure bins, (3) lack of internal validation that bins make biological sense, and (4) ignoring the presence of multiple closely related strains and genome variability within a species (the "pangenome" concept). The goals of this project are to develop a binning method that can handle the novelty and complexity of even the most diverse microbial communities on Earth, such as soil. The methods developed have the potential to transform the study of microbial communities, enabling the determination of "who is doing what?" even in highly complex communities associated with higher organisms with unknown genomes. With increased understanding of microbial communities through genome-resolved metagenomics, it will eventually be possible to model and predict complex emergent behavior of communities and delineate how their capabilities are different from the component organisms in isolation. This project also aims to address the current shortfall of STEM graduates in the United States, many of whom initially express interest in STEM fields but then are lost to other majors. Research experiences have been shown to help increase persistent interest in STEM, but they often reach only a few students, and do not necessarily allow all students equitable access. This project will establish a course-based undergraduate research experience (CURE) in the analysis of metagenomics data, aimed at undeclared majors, to address these problems.

In developing improved binning methods, the research will focus on (1) developing algorithms to leverage taxonomy information and universally-conserved marker genes for binning both well-characterized and divergent species in single samples, and (2) developing pangenome-aware algorithms to leverage co-occurrence of the same species in multiple samples for binning. This approach will maximize the information that will be gleaned from single samples, while avoiding assumptions on genome conservation within and between samples when leveraging data from multiple samples. The developed methods will be validated using simulated metagenomes as well as real data from soil. Specifically, soil communities will be followed longitudinally after forest fires, where most microbes are killed and then diversity slowly increases to the baseline. In addition to validating the performance of binning methods, these data will be used to investigate various hypotheses on why soil is able to maintain such high levels of microbial diversity. The CURE course which will be designed and implemented during this project will aim to develop skills in "big data" analysis that are becoming increasingly important in all branches of science and expose students to the analysis of shotgun metagenome data. The course will synergize with the highly popular Tiny Earth course, which is currently offered to nearly 10,000 students in 41 US states and 14 countries. The Tiny Earth network will act as a central deposit for data that can be used by institutions that lack resources for shotgun sequencing. Over time this data repository will facilitate the exploration of "big questions", thus crowdsourcing student-led discovery.

For further information, visit http://jason-c-kwan.github.io/CAREER_results.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
1845890
Program Officer
Jean Gao
Project Start
Project End
Budget Start
2019-07-01
Budget End
2024-06-30
Support Year
Fiscal Year
2018
Total Cost
$496,253
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715