Supervised machine learning is a popular method that uses labeled training examples to predict future outcomes. Unfortunately, supervised machine learning for biomedical research is often limited by a lack of labeled data. Current methods to produce labeled data involve manual chart reviews that are laborious and do not scale with data creation rates. This project aims to develop a framework to crowd source labeled data sets from electronic medical records by forming a crowd of clinical personnel labelers. The construction of these labeled data sets will allow for new biomedical research studies that were previously infeasible to conduct. There are numerous practical and theoretical challenges of developing a crowd sourcing platform for clinical data. First, popular, public crowd sourcing platforms such as Amazon's Mechanical Turk are not suitable for medical record labeling as HIPAA makes clinical data sharing risky. Second, the types of clinical questions that are amenable for crowd sourcing are not well understood. Third, it is unclear if the clinical crowd can produce labels quickly and accurately. Each of these challenges will be addressed in a separate Aim. As the first Aim of this project, the team will evaluate different clinical crowd sourcing architectures. The architecture must leverage the scale of the crowd, while minimizing patient information exposure. De-identification tools will be considered to scrub clinical notes t reduce information leakage. Using this design, the team will extend a popular open source crowd sourcing tool, Pybossa, and release it to the public. As the second Aim, the team will study the type, structure, topic and specificity of clinical prediction questions, and how these characteristics impact labeler quality. Lastly, the team will evaluate the quality and accuracy of collected clinical crowd sourced data on two existing chart review problems to determine the platform's utility.

Public Health Relevance

Traditionally, clinical prediction models rely on supervised machine learning algorithms to probabilistically predict clinical events using labeled medical records. When data sets are small, manual chart reviews performed by clinical staff are sufficient to label each outcome; however, as data sets have scaled up and researchers aim to study larger cohorts, current manual approaches become intractable. The goal of this proposal is to develop a framework to crowd source labeled data sets from electronic medical records to support prediction model development.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Exploratory/Developmental Cooperative Agreement Phase I (UH2)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Miller, David J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Vanderbilt University Medical Center
United States
Zip Code