The University of Southern California is awarded a grant for the construction of SciKnowMine, a shared computational framework that scales up processing large datasets across different communities through the automated mining of text, images, and other amenable media at the scale of the entire literature www.sciknowmine.org/). This system will be tailored specifically to support the actions of bio-curators through a generic set of web-services that may be specialized within specific curation workflows from different databases. This project is a multi-community collaboration requiring contribution from computer scientists, bioinformatics specialists, and bio-curators. SciKnowMine will be a prototype to process one million documents for a well-defined bio-curation task: document triage. This involves being able to determine automatically whether a given article is of interest to a bio-curator in the context of a specific bioinformatics database. The product will be an open informatics infrastructure for text mining that is (a) available to the computer science NLP community, (b) serves the immediate needs of bio-curators and (c) scales to accommodate millions of documents requires a deep understanding of biology, knowledge engineering, NLP and high performance computing.
Academic publishing is increasingly undergoing a radical transformation. This project, if successful, will serve as an exemplar of the kinds of cyber-infrastructure that might be applied in other scientific subjects: from Physics to Engineering to, eventually Political Science and certain parts of Sociology and Anthropology. Papers and data are increasingly made publicly available on the web (either by authors, open-access publishers or by governmental decree), exponentially increasing the quantity of text required for scientists to read to stay current in their subject and calling into question the value added by traditional commercial publishers. This project provides an opportunity to develop this infrastructure within an open-source environment that directly leverages the work of cutting-edge computer scientists. How exactly the eventual cyber-infrastructure solution is shaped, which aspects are freely available to all, and which require commercial presence, is a question of particular interest in this work.