Crowd Sourcing Labels From Electronic Medical Records to Enable Biomedical Research

Fabbri, Daniel

Abstract

Supervised machine learning is a popular method that uses labeled training examples to predict future outcomes. Unfortunately, supervised machine learning for biomedical research is often limited by a lack of labeled data. Current methods to produce labeled data involve manual chart reviews that are laborious and do not scale with data creation rates. This project aims to develop a framework to crowd source labeled data sets from electronic medical records by forming a crowd of clinical personnel labelers. The construction of these labeled data sets will allow for new biomedical research studies that were previously infeasible to conduct. There are numerous practical and theoretical challenges of developing a crowd sourcing platform for clinical data. First, popular, public crowd sourcing platforms such as Amazon's Mechanical Turk are not suitable for medical record labeling as HIPAA makes clinical data sharing risky. Second, the types of clinical questions that are amenable for crowd sourcing are not well understood. Third, it is unclear if the clinical crowd can produce labels quickly and accurately. Each of these challenges will be addressed in a separate Aim. As the first Aim of this project, the team will evaluate different clinical crowd sourcing architectures. The architecture must leverage the scale of the crowd, while minimizing patient information exposure. De-identification tools will be considered to scrub clinical notes t reduce information leakage. Using this design, the team will extend a popular open source crowd sourcing tool, Pybossa, and release it to the public. As the second Aim, the team will study the type, structure, topic and specificity of clinical prediction questions, and how these characteristics impact labeler quality. Lastly, the team will evaluate the quality and accuracy of collected clinical crowd sourced data on two existing chart review problems to determine the platform's utility.

Public Health Relevance

Traditionally, clinical prediction models rely on supervised machine learning algorithms to probabilistically predict clinical events using labeled medical records. When data sets are small, manual chart reviews performed by clinical staff are sufficient to label each outcome; however, as data sets have scaled up and researchers aim to study larger cohorts, current manual approaches become intractable. The goal of this proposal is to develop a framework to crowd source labeled data sets from electronic medical records to support prediction model development.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Cancer Institute (NCI)
Type: Exploratory/Developmental Cooperative Agreement Phase I (UH2)
Project #: 5UH2CA203708-02
Application #: 9270528
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Miller, David J

Project Start: 2016-05-06
Project End: 2019-04-30
Budget Start: 2017-05-01
Budget End: 2019-04-30
Support Year: 2
Fiscal Year: 2017
Total Cost
Indirect Cost

Institution

Name: Vanderbilt University Medical Center
Department
Type
DUNS #: 079917897

City: Nashville
State: TN
Country: United States
Zip Code: 37232

Related projects


NIH 2017 UH2 CA	Crowd Sourcing Labels From Electronic Medical Records to Enable Biomedical Research Fabbri, Daniel / Vanderbilt University Medical Center
NIH 2016 UH2 CA	Crowd Sourcing Labels From Electronic Medical Records to Enable Biomedical Research Fabbri, Daniel / Vanderbilt University Medical Center

Publications

Ye, Cheng; Fabbri, Daniel (2018) Extracting similar terms from multiple EMR-based semantic embeddings to support chart reviews. J Biomed Inform 83:63-72

Comments

Be the first to comment on Daniel Fabbri's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: