Living in highly structured multispecies communities, human-associated microbes engage in extensive cell- to-cell interactions that make the biofilm function. Quantifying the underlying dependence of microbes, as reflected in the joint distribution of organisms, is required to investigate the role of microorganisms in human health and disease. In genetic sequencing studies of oral microbiota, the primary outcome of interest is typically bacterial sequence counts. Biostatistical analysis requires overcoming three challenges. First, applying separate taxon-by- taxon analysis for each of hundreds of organisms ignores the dependence structure and likely results in inflated type I error and lower power. A multivariate (joint endpoint analysis) method can be used to jointly model the multiple endpoints, but there are particular challenges in a high-dimensional setting such as whole-community microbiome analysis. Second, conventional regression models for count data are likely to fit poorly to microbiome data, in which typically many species are observed to be present in only a few subjects (?zero inflation?). Finally, selecting which covariates to include for their association with microbial counts is difficult. To our knowledge, no comprehensive multivariate methods have been developed to simultaneously address these three challenges. We have begun developing such a method and can now model the joint distribution of up to 20 count responses, such as for bacterial taxa. Based on simulation studies, the method performs much better than univariate methods.
The specific aim of this proposal is to scale up the method further to be applicable to oral microbiome data. Specifically, we will develop a multivariate regression method for the joint endpoint analysis of high-dimensional zero-inflated count data. The model builds upon a zero- inflated distribution that can naturally account for excess zeros. To address the high-dimensionality of microbiome sequencing data we devise a flexible parametric covariance structure. Furthermore, we develop a statistical framework to select a subset of covariates for each of multiple taxa, simultaneously. The goal of this proposal is therefore to build a Bayesian multivariate zero-inflated regression model that accommodates high-dimensional microbial counts. We propose to develop and scale up (1) novel Bayesian multivariate regression methods, and (2) Bayesian multivariate variable selection methods. The study team has deep experience developing Bayesian high-dimensional multivariate analysis methods and includes microbiologists. We will test the methods through simulated and real data analysis. We will make user-friendly software and tutorials freely available. The core innovation proposed is to resolve the challenges inherent to high-dimensional sequencing count data by developing multivariate regression and variable selection methods. We anticipate wide use of these novel methods and software in the field of microbiology and biostatistics.
Human-associated microbes live in multispecies communities, where they interact. The relationships among bacterial species should be incorporated in the analysis of their genetic information. We propose to create and share tools for these analyses, which can be used to investigate key bacterial associations that may serve as new targets for the prevention or treatment of disease.