Supervised machine learning is a popular method that uses labeled training examples to predict future outcomes. Unfortunately, supervised machine learning for biomedical research is often limited by a lack of labeled data. Current methods to produce labeled data involve manual chart reviews that are laborious and do not scale with data creation rates. This project aims to develop a framework to crowd source labeled data sets from electronic medical records by forming a crowd of clinical personnel labelers. The construction of these labeled data sets will allow for new biomedical research studies that were previously infeasible to conduct. There are numerous practical and theoretical challenges of developing a crowd sourcing platform for clinical data. First, popular, public crowd sourcing platforms such as Amazon's Mechanical Turk are not suitable for medical record labeling as HIPAA makes clinical data sharing risky. Second, the types of clinical questions that are amenable for crowd sourcing are not well understood. Third, it is unclear if the clinical crowd can produce labels quickly and accurately. Each of these challenges will be addressed in a separate Aim. As the first Aim of this project, the team will evaluate different clinical crowd sourcing architectures. The architecture must leverage the scale of the crowd, while minimizing patient information exposure. De-identification tools will be considered to scrub clinical notes t reduce information leakage. Using this design, the team will extend a popular open source crowd sourcing tool, Pybossa, and release it to the public. As the second Aim, the team will study the type, structure, topic and specificity of clinical prediction questions, and how these characteristics impact labeler quality. Lastly, the team will evaluate the quality and accuracy of collected clinical crowd sourced data on two existing chart review problems to determine the platform's utility.

Public Health Relevance

Traditionally, clinical prediction models rely on supervised machine learning algorithms to probabilistically predict clinical events using labeled medical records. When data sets are small, manual chart reviews performed by clinical staff are sufficient to label each outcome; however, as data sets have scaled up and researchers aim to study larger cohorts, current manual approaches become intractable. The goal of this proposal is to develop a framework to crowd source labeled data sets from electronic medical records to support prediction model development.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Exploratory/Developmental Cooperative Agreement Phase I (UH2)
Project #
1UH2CA203708-01
Application #
9076555
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Miller, David J
Project Start
2016-05-06
Project End
2018-04-30
Budget Start
2016-05-06
Budget End
2017-04-30
Support Year
1
Fiscal Year
2016
Total Cost
Indirect Cost
Name
Vanderbilt University Medical Center
Department
Type
DUNS #
079917897
City
Nashville
State
TN
Country
United States
Zip Code
37232