The Text Mining Pipeline to Accelerate Systematic Reviews in Evidence-Based Medicine will combine important research in several areas of biomedical text mining that are necessary to enable much-needed improvements in the process of conducting systematic reviews via a text mining enhanced workflow. Our consortium will undertake three specific aims to support this work:
Aim 1. Study how to create a metasearch engine and database that collects information from important systematic review sources, indexes this information consistently, and provides a robust information retrieval system with high recall and precision for accessing this expanded literature collection.
Aim 2. Study how to create a literature classification and ranking system that is customizable and trainable for each user, systematic review group, and systematic review topic. This supervised learning based classification and ranking system takes as input the list of retrieved articles corresponding to a given query, and outputs them grouped by article type, in order of predicted probability of relevance to an individual writing a systematic review on the given topic.
Aim 3. Study how to create a study aggregator that collects together articles that refer to the same underlying clinical trial. This will save reviewers work and time as they will now have automated assistance in determining whether two articles are independent data sources, or derive their evidence from the same primary data. Taken together, these results will inform construction of a text mining pipeline system that will decrease the manual burden of systematic reviewers during the literature collection and review process, and increase the proportion of reviewer time spent synthesizing evidence and performing meta-analyses. The system will lead to a real difference in the rate that high-quality evidence reports can be compiled. Ultimately, the coverage, dissemination, and acceptance of evidence- based medicine in the biomedical community will increase, resulting in better and more cost- effective clinical care.

Public Health Relevance

This project will improve the process of summarizing the best available medical evidence for a wide range of medical conditions. These summaries are utilized by both medical practitioners and policy makers as an essential component of providing higher quality, more cost-effective medical care for everyone.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Illinois at Chicago
Schools of Medicine
United States
Zip Code
Smalheiser, Neil R (2017) Rediscovering Don Swanson: the Past, Present and Future of Literature-Based Discovery. J Data Inf Sci 2:43-64
Peng, Yufang; Bonifield, Gary; Smalheiser, Neil R (2017) Gaps within the Biomedical Literature: Initial Characterization and Assessment of Strategies for Discovery. Front Res Metr Anal 2:
Wallace, Byron C; Noel-Storr, Anna; Marshall, Iain J et al. (2017) Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. J Am Med Inform Assoc 24:1165-1168
Smalheiser, Neil R; Bonifield, Gary (2016) Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. J Biomed Discov Collab 7:e1
Smalheiser, Neil R; Shao, Weixiang; Yu, Philip S (2015) Nuggets: findings shared in multiple clinical case reports. J Med Libr Assoc 103:171-6
Cohen, Aaron M; Smalheiser, Neil R; McDonagh, Marian S et al. (2015) Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. J Am Med Inform Assoc 22:707-17
Shao, Weixiang; Adams, Clive E; Cohen, Aaron M et al. (2015) Aggregator: a machine learning approach to identifying MEDLINE articles that derive from the same underlying clinical trial. Methods 74:65-70
D'Souza, Jennifer L; Smalheiser, Neil R (2014) Three journal similarity metrics and their application to biomedical journals. PLoS One 9:e115681
Jiang, Yu; Lin, Can; Meng, Weiyi et al. (2014) Rule-based deduplication of article records from bibliographic databases. Database (Oxford) 2014:bat086
Edinger, Tracy; Cohen, Aaron M (2013) A large-scale analysis of the reasons given for excluding articles that are retrieved by literature search during systematic review. AMIA Annu Symp Proc 2013:379-87

Showing the most recent 10 out of 11 publications