Evidence-based medicine (EBM) looks to inform patient care with the totality of available relevant evidence. Systematic reviews are the cornerstone of EBM and are critical to modern healthcare, informing everything from national health policy to bedside decision-making. But conducting systematic reviews is extremely laborious (and hence expensive): producing a single review requires thousands of person-hours. Moreover, the exponential expansion of the biomedical literature base has imposed an unprecedented burden on reviewers, thus multiplying costs. Researchers can no longer keep up with the primary literature, and this hinders the practice of evidence-based care. The long term aim of this work is to develop computational tools and methods that optimize the practice of EBM. The proposed work thus builds upon our previous successful efforts developing computational approaches that reduce the workload in EBM. More speci?cally, we aim to develop tools that semi-automate the laborious task of data extraction - identifying and extracting the information of interest (e.g., trial sample size, interventions and outcomes) from the free-texts of biomedical articles - via novel machine learning methods. Semi-automating this task will drastically reduce reviewer workload, thus enabling the practice of EBM in an age of information overload. Previous efforts to automate data extraction from articles describing clinical trials have shown promise, but lack the accuracy and scope necessary for real-world use. These approaches have been impeded by the absence of a large corpus of annotated clinical trials, and by the dif?culty of constructing models to automatically extract all of the variables necessary for synthesis. We describe methodological innovations to overcome these hurdles. First, to train our machine learning models we propose leveraging large existing databases that contain structured information about clinical trials, in lieu of the usual approach of collecting expensive manual annotations. Practically, this means we will be able to exploit a very large `pseudo-annotated' dataset that is an order of magnitude bigger than what has been used in previous efforts, thus substantially improving model performance. Our extensive preliminary work demonstrates the promise and feasibility of this approach. Second, we propose novel machine learning models appropriate for the tasks of article categorization and data extraction for EBM. These models will speci?cally be designed to perform extraction of multiple, correlated data elements of interest while simultaneously classifying articles into clinically salient categories useful for EBM. We will rigorously evaluate the developed methods to assess their practical utility, speci?cally y comparing automated extraction accuracy to that achieved by trained systematic reviewers. And to make these methods useful to end-users (systematic reviewers), we will develop and evaluate open-source software and tools, including a web-based extraction tool that integrates our machine learning models to automatically extract information from uploaded articles (PDFs). We will conduct a user study to evaluate the utility and usability of this tool in practice.

Public Health Relevance

We propose to develop computational methods and tools that make the practice of evidence-based medicine (EBM) more ef?cient, speci?cally by semi-automating data extraction from the full-texts of articles describing clinical trials. Such tools would drastically reduce the workload currently involved in producing evidence syntheses, ultimately enabling evidence- based care in an era of information overload.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Northeastern University
Schools of Arts and Sciences
United States
Zip Code
Marshall, Iain J; Noel-Storr, Anna; Kuiper, Joël et al. (2018) Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide. Res Synth Methods 9:602-614
Marshall, Iain J; Kuiper, Joël; Banner, Edward et al. (2017) Automating Biomedical Evidence Synthesis: RobotReviewer. Proc Conf Assoc Comput Linguist Meet 2017:7-12
Singh, Gaurav; Marshall, Iain J; Thomas, James et al. (2017) A Neural Candidate-Selector Architecture for Automatic Structured Clinical Text Annotation. Proc ACM Int Conf Inf Knowl Manag 2017:1519-1528
Zhang, Ye; Marshall, Iain; Wallace, Byron C (2016) Rationale-Augmented Convolutional Neural Networks for Text Classification. Proc Conf Empir Methods Nat Lang Process 2016:795-804
Wallace, Byron C; Kuiper, Joël; Sharma, Aakash et al. (2016) Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision. J Mach Learn Res 17:
Yu, Zhiguo; Bernstam, Elmer; Cohen, Trevor et al. (2016) Improving the utility of MeSH® terms using the TopicalMeSH representation. J Biomed Inform 61:77-86