In modern biomedical studies it has become commonplace to collect high-dimensional data, and hence dimen- sionality reduction tools are of critical importance and are routinely used. Some of the most common include clustering and factor analysis. The basic tenet behind dimensionality reduction is that we can replace a high dimensional set of variables by some low-dimensional summary. This is certainly necessary to make sense of complex data and also overcome problems with high-dimensional, low sample size data. However, a critical is- sue that has not been adequately studied is reproducibility. Standard approaches for dimension reduction can be very sensitive to choice of tuning parameters and arbitrary choices (e.g., choice of kernel or distance meas- ure). This leads to a lack of robustness, with potentially very different results being produced when data are slightly perturbed. This lack of robustness tends to be compounded as the size of the data increases - both in terms of the sample size and number of variables collected. Also, a critical issue is lack of generalizability. In particular, dimensionality reduction for a particular group of individuals may fundamentally lack generalizability to other groups of individuals. This creates major problems in interpretation of results. Motivated in particular by environmental epidemiology studies collecting exposome data and by nutritional epidemiology, this project proposes to develop fundamentally new methods for improving robustness and reproducibility of di- mensionality reduction through the following specific aims. (1) Develop robust methods of factor analysis designed to limit sensitivity to arbitrary assumptions and size of the data. (2) Develop robust methods of model-based clustering designed to limit sensitivity to arbitrary assump- tions and size of the data. (3) Develop novel methods for robust clustering from multivariate and grouped data designed to avoid typical pitfalls of mixture models with increasing p. (4) Develop robust consensus methods that estimate low dimensional summaries that best reflect struc- ture across subpopulations. (5) Apply the proposed methods to data from key epidemiologic cohorts that have measured a wide va- riety of environmental, behavioral, and biological exposures and provide a general use software package for implementation. This package is designed to be easily used and accommodate a broad variety of data types, further aiding reproducibility and transparency.

Public Health Relevance

High-dimensional data on important environmental, behavioral, biological, sociodemographic, and physical factors are routinely collected as part of scientific studies. These data are often interrelated in important ways, for example an individual who makes healthy dietary choices may also be more likely to exercise and less likely to engage in risky behaviors. Dimension reduction methods are often used in analysis to facilitate interpretable inference. This project focuses on new statistical methods needed to improve reproducibility and generalizability of dimension reduction within and across different populations of interest.

National Institute of Health (NIH)
National Institute of Environmental Health Sciences (NIEHS)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Shreffler, Carol A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Duke University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code