The proposed Dynamic Mining and Contextualization of the Scientific Literature (DMCSL) provides an open lane of communication between authors, science journals, readers, and databases. The outcome of this communication portal will be a database containing mineable metadata for researchers, reagent supply and biotech companies. Data will be available to companies through individualized subscription models. This pipeline identifies biological entities, e.g., gene, alleles, etc., and embeds hyperlinks from these entities to NHGRI-funded curated Model Organism Databases (MODs). DMCSL is an enhancement of a markup pipeline that has been in effect since 2009, and has linked biological entities in over 850 research articles in GENETICS and G3, published by the Genetics Society of America (GSA), to pages in MODs, WormBase, Flybase, and the Saccharomyces Genome Database. This proposal seeks funding to expand the scope of the GSA markup pipeline in all aspects: biological entities linked; authoritative databases linked to (Rat Genome Database; Mouse Genome Information; Zebrafish Model Organism Database; and the fission yeast genome database); and journals linked from. This expansion will also include collecting information on supplies and equipment described in Materials and Method sections of articles along with supplier information. The DMCSL will collect and store link information along with author and journal metadata and link access statistics. By doing so, the DMCSL will provide valuable metrics to all stakeholders, including biotech companies and life science vendors as well as a comprehensive and queryable view of biology not currently available. In Phase I, we will develop code that is flexible enough to scale the pipeline to link an article to more lexica and more databases within a single article and within a strict time limit of turnaround set by the publisher's production process. We will also be testing the software in linking publications of other journals and develop tools to query and data mine relationships identified through the data extraction process. We will develop basic API's to serve as a core API database resource; a linking API to store created links and monitor link activity, and use modern API management to develop a portal for key-based access to other API data. Proving stability and flexibility of the software based on current parameters, in Phase II we will work in collaboration with a wider range of stakeholders, more journals, more databases, including expanding to human biomedical databases, and more companies, to develop experience-based APIs for each stakeholder group. These APIs will be intuitively designed based on how each group interacts with the basic API developed in Phase I, and will be used to develop subscription-based access for commercial companies, access for academic stakeholders and collaborating journals will remain free.

Public Health Relevance

The Dynamic Mining and Contextualization of Science Literature (DMCSL) accelerates the rate of scientific discovery and reproducibility by creating interactive science articles and collecting data metrics valuable to all stakeholders: researchers, journals, databases, biotech research companies, and life science vendors. The DMCSL creates a communication bridge between authors and authoritative databases allowing databases to enforce the use of standardized nomenclature, thereby promoting scientific provenance and reproducibility.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Small Business Innovation Research Grants (SBIR) - Phase I (R43)
Project #
1R43HG009631-01
Application #
9345927
Study Section
Special Emphasis Panel (ZRG1-RPHB-C (11)B)
Program Officer
Sofia, Heidi J
Project Start
2017-04-01
Project End
2018-03-31
Budget Start
2017-04-01
Budget End
2018-03-31
Support Year
1
Fiscal Year
2017
Total Cost
$211,668
Indirect Cost
Name
Insilico, Inc.
Department
Type
Domestic for-Profits
DUNS #
034449576
City
Eugene
State
OR
Country
United States
Zip Code
97405