Large-Scale Nationally Representative Patient-generated Health Data for Development of Generalizable Data Science Methodologies for Precision Public Health. Racial-ethnic minorities, socioeconomically disadvantaged, and other underserved populations experience disproportionate adverse health outcomes despite decades of research correlating social determinants (SDs) to variations in health outcomes. Many public health approaches use population averages to create ?one-size-fits-all? interventions to increase the probability of achieving the best outcomes for the average person, but are limited by population heterogeneity in number, magnitude, interplay, and amplification of SDs. Precision public health (PPH) emerged to use digital technologies (DTs) to develop interventions targeting unique needs of specific populations to improve the health and reduce disparities. Analysis of voluminous, precise, continuous, and longitudinal data generated by DTs holds great promise for PPH as smartphones, Internet of Things, and wearable sensors are becoming ubiquitous, generating data on environment, transportation, geolocation, diet, exercise, social interactions, and daily activities. These person-generated health data (PGHD) have unprecedented potential to add rich insight on everyday human behaviors to traditional health research. Though clinical PGHD applications are in early stages, there is rapid progress in development of digital indicators of health, offering virtually limitless potential. Because PGHD are typically captured outside of controlled research settings, they suffer from challenges of non-traditional data that impede their acceptance and use across the healthcare ecosystem. First, PGHD are vulnerable to input biases as users of consumer DTs are a self-selected group. Second, PGHD suffer from poor internal data quality due to high variability in completeness for reasons that are not always equally distributed across individuals (e.g., connectivity issues, battery, user forgetfulness, user error). Together, input bias and poor data quality lead to poor external validity, where analytics derived from PGHD are not generalizable to the broader population. The objective of this partnership between the RAND Corporation and Evidation Health is to improve generalizability of data science methods for PGHD, allowing for representation of all population groups, including the historically underserved. We will accomplish this goal via three aims: (i) generate PGHD from a nationally representative probability sample of Americans to understand the social distribution of user engagement with health DTs and poor sleep health; (ii) develop a methodology that characterizes missing data within PGHD and selects appropriate imputation strategies (existing and novel) optimized for reduction in model bias and socio- demographic input disparities; and, (iii) create a propensity-score based statistical weighting methodology to improve the effectiveness and applicability of methods derived from non-random, self-selected, and/or already collected PGHD in underserved populations. This work will enable future identification and application of digital indicators for health interventions that account for all populations, a critical first step for digital PPH.

Public Health Relevance

Person-generated health data (PGHD), derived from consumer digital technologies such as smartphones, and wearable sensors, are increasingly useful for the development of digital indicators of health that can be used by consumers and health professionals to monitor and maintain healthy behaviors in real-time and longitudinally. Because PGHD are typically captured outside of controlled research/clinical settings, they suffer from challenges of non-traditional data (lack of representativeness, poor data quality and missing elements, and poor external validity) that impede their effectiveness when used in interventions among diverse populations, potentially exacerbating health disparities. Here, we propose the development of a set of tools that allow for digital health research to account for all populations, including the historically underserved: (i) a large, truly representative set of PGHD; (ii) imputation algorithms to handle various types of missing elements within those data to improve quality; and, (iii) a propensity score based statistical weighting methodology that can be applied to existing non- representative PGHD to improve their generalizability/external validity.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
1R01LM013237-01A1
Application #
10052114
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
2020-07-01
Project End
2024-03-31
Budget Start
2020-07-01
Budget End
2021-03-31
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Southern California
Department
Miscellaneous
Type
Schools of Arts and Sciences
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90089