Scientists have become unable to keep up with the ever-expanding number of scientific publications. The lack of this ability is a fundamental bottleneck to scientific progress. Current search technologies are limited because they are able to find many relevant documents, but cannot extract and organize the information content of these documents or suggest new scientific hypotheses based on the organized content. Natural Language Processing (NLP) based text mining strategies are a recognized means to approach this problem, but most scientists do not have the expertise or time to take use them. In addition, the lack of interoperability among NLP tools as well as the data in repositories scattered around the web are barriers to sharing workflows, resources, and results. This project will identify what analysis features are needed within an easy-to-use platform for mining scientific texts, implement an initial version of such a platform, and make it available to scientists.

There is currently no open, easy-to-use platform for mining scientific texts that provides interoperable access to a wide array of software, computing resources, and publication data. Publicly available software (such as Google) is not geared toward publication data, and in-house tools are fragile and deliver only a fraction of relevant results. The main objective of this project is, therefore, to (1) identify the requirements for an easy-to-use platform for mining information from scientific publications and (2) deploy facilities that meet these needs. To achieve this goal this project will extend the already existing NSF-funded LAPPS Grid to include means to access a broad range of interoperable NLP tools, large bodies of publication data and lexical and ontological resources, and, crucially, to rapidly adapt existing software to new domains and evaluate results. This project will also leverage enhancements to the NSF-funded Galaxy platform for interactive data exploration and extended access to NSF hardware resources (XSEDE machines including Stampede, Bridges, and Jetstream). By providing access to services for mining scientific publications and lowering the barriers to entry resulting from licensing, redistribution, and intellectual property concerns, this project provides capabilities that were previously unavailable to scientists. Researchers are able to perform large-scale text mining using an HPC infrastructure through a web-based interface without the need to know about underlying infrastructure. Additionally, providing iterative domain adaptation capabilities enables scientists to easily adapt existing services to specialized areas without configuring or installing additional components. The ability to examine both explicit and implicit information scattered across massive repositories of publications will undoubtedly result in new observations and insights.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1811123
Program Officer
Stefan Robila
Project Start
Project End
Budget Start
2018-06-01
Budget End
2019-12-31
Support Year
Fiscal Year
2018
Total Cost
$177,639
Indirect Cost
Name
Vassar College
Department
Type
DUNS #
City
Poughkeepsie
State
NY
Country
United States
Zip Code
12604