Scientific evidence is primarily disseminated in free-text journal articles. Drawing upon this evidence to make decisions or inform policies therefore requires perusing relevant articles and manually extracting the findings of interest. Unfortunately, this process is time-consuming and has not scaled to meet the demands imposed by the torrential expansion of the scientific evidence base. This work seeks to design novel Natural Language Processing (NLP) methods that can automatically "read" and make sense of unstructured published scientific evidence. This is critically important because decisions by policy-makers, care-givers and individuals should be informed by the entirety of the relevant published scientific evidence; but because evidence is predominantly unstructured -- and hence not directly actionable -- this is currently impossible in practice. Consider clinical medicine, an important example which serves as the target domain of this proposal (although the framework and models will generalize to other scientific areas). Roughly 100 articles describing trials were published every single day in 2015. Healthcare professionals cannot possibly make sense of this, and thus treatment decisions must be made without full consideration of the available evidence. Methods that can automatically infer from this torrential mass of unstructured literature which treatments are actually supported by the evidence would facilitate better, evidence-based decisions. Toward this end, this research seeks to design NLP models capable of mapping from natural language scientific articles describing studies or trials to structured "evidence frames" that codify the interventions and outcomes studied, and the reported findings concerning these. NLP technology is not presently up to this task. Therefore, this project will support core methodological contributions that will advance systems for data extraction and machine reading of lengthy articles; these will have impact beyond the present motivating application.

From a technical perspective, the focus of this work concerns developing novel, interpretable (transparent) neural network models for extraction from and inference over lengthy articles. Specifically, this project aims to design models that can automatically identify treatments and associated outcomes from free-texts, and then infer the reported comparative effects of the former with respect to the latter. This pushes against limits of existing language technology capabilities. In particular, this necessitates models that perform deep analysis of individual, potentially lengthy, technical documents. Furthermore, model transparency is critical here, as domain experts must be able to recover from where in documents evidential claims were inferred. New corpora curated for this project (to be shared with the broader community) will facilitate core NLP research on such models. To realize the aforementioned methodological aims, the researchers leading this project will develop conditional and dynamic "attentive" neural models. Specific methodological lines of research to be explored include: (i) Models equipped with conditional, sparse attention mechanisms over textual units that reflect scientific discourse structure to achieve accurate and transparent extraction of, and inference concerning, reported evidence. (ii) Neural sequence tagging models that take multiple 'reads' of a text, exploiting iteratively adjusted conditional document representations as global context to inform local predictions. A project website (www.byronwallace.com/evidence-extraction) provides access to papers, datasets and other project outputs.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1750978
Program Officer
Hector Munoz-Avila
Project Start
Project End
Budget Start
2018-07-01
Budget End
2023-06-30
Support Year
Fiscal Year
2017
Total Cost
$565,933
Indirect Cost
Name
Northeastern University
Department
Type
DUNS #
City
Boston
State
MA
Country
United States
Zip Code
02115