Rapid development of various high-throughput biological technologies has revolutionized the field of genomics. Various genomic studies produce molecular-level traits by measuring gene expression levels and characterizing various covalent modifications of DNA and histone proteins. The measured molecular-level traits, including gene expression and methylation levels, are thought to mediate the effects of DNA and/or the environment on many traits and diseases, and hold the key to understanding the genetic and environmental basis of disease susceptibility and phenotypic variation. In particular, these high-dimensional biomarkers absorb and reflect environmental insults to the genome and serve as measures of an individual's internal molecular and cellular environment that change dynamically over the time course of life. In this project, statistical methods will be developed to perform high-dimensional mediation analysis, in order to further our understanding of the molecular basis of disease susceptibility and phenotypic variation, and facilitate the integrative analysis of various molecular-level traits from omics studies. The proposed statistical methods will be used to study how the inherited DNA environment and the external environment, as measured through environmental toxicants, socioeconomic conditions, neighborhood characteristics, psychosocial stress and other life events, influence omics measures of the internal environment, and in turn lead to adverse health outcomes.
Technically, the molecular-level traits from omics studies will be treated as a multivariate set of high-dimensional mediators for integrative analysis. A novel high-dimensional mediation analysis framework will be developed to handle multiple exposures and multiple mediators simultaneously. The proposed high-dimensional mediation analysis methods will extend existing mediation analysis methods from handling univariate mediator and/or univariate exposure to a high-dimensional setting by making additional modeling assumptions on the effects of mediators and exposures to enable model identifiability. While the problem is formulated in a causal inference framework, inference will be conducted using a Bayesian variable selection framework that identifies important exposures and mediators simultaneously. The research combines ideas from variance component score tests, Bayesian variable selection, and causal inference in a unified manner to lead to new theoretical insights on estimation of direct and indirect effects. Methodological extensions will also be made to conduct mediation analysis based on sharing summary statistics that are becoming increasingly common in genetics studies. The newly developed methods will be applied to large ongoing cohort/case-control studies, and software will be developed for scalable implementation of the proposed methods.