Current techniques for answering questions about the influence of behaviorial and environmental factors on public health are based on surveys, which are costly and subject to response bias, or simulations, which rely on possibly incorrect or simplistic assumptions. The TwitterHealth project is developing techniques to extract reliable public health information from social media. In essence, the online population becauses a vast organic sensor network. Statistical natural language processing techniques are employed to classify tweets (or other social media postings) as self-reports of disease or particular behaviors of interest. GPS information included in postings made from cell phones allow a variety of behavioral information to be inferred about each user, such as the venues visited and the other individuals from the data set who are encountered.

Major technical challenges for using social media in this manner are the highly noisy nature of the information channel, scaling to a large number of different health conditions, and the need to discover causal influences as well as correlations between behavioral and environmental factors and health. The challenge of noise is approached by learning dynamic relational models of health states, which generalize classical epidemiological models but support individual as well as aggregate predictions. The scaling challenge is dealt with by knowledge transfer techniques, which reduce data and computational requirements by transfering information between models for different health conditions. Specific knowledge transfer techniques are cascaded training of a target classifier starting with a given classifier for a related but different disease, and the use of ensembles of general and specific classifiers. The challenge of inferring casuality is addressed by temporal-lag methods, which identify changes in behaviorial or environmental conditions that consistently precede changes in health. For example, the inference that a venue is a cause (vector) of disease spread is accomplished by tracing backward in time the GPS trails of users who post social media reports of illness. TwitterHealth employs two approaches for validating its results: first, comparing the aggregate predictions of the model against CDC statistics; second, comparing individuals' behavior in reporting or not reporting disease symptoms in status updates against the behavior predicted by the models. The project also includes planning for clinic based evaluations, in which subjects identified by their social media postings would provide swabs that would be tested for disease agents.

The TwitterHealth approach to collecting and analyzing health information has the potential to improve public health, by making detailed data about health, behavior, social structure, and geographic influences available in real time and at almost no cost. While it will not completely replace traditional methods of gathering health information, it provides an important complementary information channel, which emphases speed, reach, and scale. The project includes outreach expert medical professionals in order to plan future clinical validation. The outreach interaction provides a forum for exchange of computer science and medical expertise between researchers and students in the two fields. Information about the project is available online at www.cs.rochester.edu/u/kautz/twitterhealth.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1319378
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-09-01
Budget End
2017-08-31
Support Year
Fiscal Year
2013
Total Cost
$497,939
Indirect Cost
Name
University of Rochester
Department
Type
DUNS #
City
Rochester
State
NY
Country
United States
Zip Code
14627