The microbiome, which plays an important role in human health and disease, is generally characterized using high throughput genome sequencing. However, the laboratory processes required for microbial metagenomic sequencing can introduce spurious measurement noise due to, for example, DNA extraction, amplification, sequencing depth, GC bias, batch effects, laboratory protocols, and bioinformatics processing. Without correction, the magnitude of sample- and study- specific variation can easily exceed the magnitude of variation due to treatment or disease status. Therefore, diagnosis and treatment of diseases and infections based on microbial sequencing is impeded by spurious noise that masks true biological signal. The overall goals of this research are to develop new statistical methods for the analysis of microbiome data, including taxonomic, functional, and metabolic data. Our statistical models will explicitly model batch and technical variation, allowing us to distinguish, rather than conflate, biological signal and non-biological noise. Our new models will leverage commonly-collected sequence data, such as positive controls and technical replicates, which are not typically utilized by researchers in their statistical analysis of microbiome data. By designing statistical methods that use existing data sources, we will reduce the amount and cost of sequencing required to detect true biological signals. Our models will allow us to perform hypothesis testing for differential abundance of microbial genes, strains, and metabolites, as well as shifts in the diversity of microbial communities, without discarding biological signal or detecting spurious technical noise due to imperfect laboratory protocols and instrumentation. The methods are applicable to a broad range of experimental designs (including observational and longitudinal), biomedical research methods (including model systems and clinical trials), and sequencing platforms (including marker gene and whole genome sequencing as well as spectrometric methods for metabolic and proteomic profiling). Our statistical methods will be distributed as freely available, open-source software, which will include extensive tutorials, and forums for user questions. By avoiding detection of signals due to sample- and study-!specific artefacts, our methods will increase the reproducibility of microbiome research, and facilitate the identification of therapeutic and diagnostic opportunities in microbiome science.

Public Health Relevance

The human microbiome, which plays an important role in many diseases, is generally characterized using high throughput genome sequencing, which can induce measurement noise due to sequencing depth, batch effects, and laboratory protocols. The overall goals of this research are to develop new statistical methods and software that explicitly model batch and technical variation, allowing us to distinguish, rather than conflate, biological signal and non- biological noise. These methods will enable biomedical scientists to increase the reproducibility of microbiome research, facilitating the identification of the specific biological elements within the microbiome that influence human health.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Unknown (R35)
Project #
5R35GM133420-02
Application #
10000959
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Brazhnik, Paul
Project Start
2019-09-01
Project End
2024-06-30
Budget Start
2020-07-01
Budget End
2021-06-30
Support Year
2
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Washington
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195