Two challenges in analyzing health effects of multi-pollutant, long-term air pollution exposure are: (i) interpreting parameters in health effect regressions (requiring dimension reduction by principal component analysis, clustering, etc.) and (ii) spatial misalignment of exposure data. Spatial misalignment refers to the situation where exposure data are not available at locations where subjects live, so exposures need to be estimated using a spatial prediction model based on monitoring data from different locations. This first stage exposure modeling typically combines regression on geographic covariates with spatial smoothing. Predictions from the first-stage model are then used in a health effect regression model. Extending this paradigm to multi- pollutant studies requires generalizing spatial prediction methods to multivariate exposure vectors, ideally using a method that is synergistic with (or at least compatible with) the dimension reduction for health effect analyses. We propose two novel methods: predictive sparse principal component analysis and predictive k- means clustering. Our methods seeks to find sparse principal component loadings and k-means cluster centers that explain a large proportion of the variability in the data while ensuring the corresponding low- dimensional representations are predictable at subject locations. Predictions of these lower dimensional quantities can be used in health effect regressions. Our approach is preferable to a sequential approach (dimension reduction followed by spatial prediction), which may result in representations that are difficult to predict at subject locations We illustrate the practical utility of our methods by applying them to national monitoring data from EPA regulatory networks and epidemiologic analyses of the Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air).

Public Health Relevance

The methods developed here will substantially improve our ability to attribute observed health effects of air pollution exposure to specific classes or mixtue of pollutants. The enhanced epidemiologic findings that result will enable us to hypothesize specific biological pathways, to identify genetic or behavioral risk modifiers, and to guide development of more precisely targeted regulatory policies. While the increased disease risk from air pollution is small for any single individual, the public health implications are significat due to the large number of people exposed and the ability of governments to mitigate exposure through regulatory action.

National Institute of Health (NIH)
National Institute of Environmental Health Sciences (NIEHS)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Thompson, Claudia L
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Keller, Joshua P; Drton, Mathias; Larson, Timothy et al. (2017) COVARIATE-ADAPTIVE CLUSTERING OF EXPOSURES FOR AIR POLLUTION EPIDEMIOLOGY COHORTS. Ann Appl Stat 11:93-113
Jandarov, Roman A; Sheppard, Lianne A; Sampson, Paul D et al. (2017) A novel principal component analysis for spatially misaligned multivariate air pollution data. J R Stat Soc Ser C Appl Stat 66:3-28
Keller, Joshua P; Chang, Howard H; Strickland, Matthew J et al. (2017) Measurement Error Correction for Predicted Spatiotemporal Air Pollution Exposures. Epidemiology 28:338-345