? ? An information retrieval and extraction system that processes the full text of biological papers will be ? developed. A prototype system has been in operation at WormBase for over a year, used by C. elegans ? researchers as well as WormBase biological curators, and has recently been implemented for yeast at SGD. The system, called Textpresso, separates text into sentences, and labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises 37 categories of terms, such as """"""""gene,"""""""" """"""""regulation,"""""""" """"""""method,"""""""" etc. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a threefold increase of search efficiency. This system will be further developed in three ways. First, the core system will be refined and altered to allow expansion to multiple domains of interest, e.g., model organisms, human disease. Simple modifications to the system and website functionality will be made, including synonym, search phrases, and case-sensitivity. A software package for local installation will be supported. The project team will maintain the Textpresso site (www.textpresso.org). which will include C. elegans and pilot systems, but software package will be available for installation of Textpresso at local sites, e.g., SGD, Flybase etc. Second, the ontology will be structured somewhat more deeply and lexica expanded for organism and field ? specific terms. Third, algorithms for information extraction will be implemented. One approach will be the implementation of similarity measures using categories (high level nodes) of the Textpresso ontology to reduce the dimensionality of associated vector spaces. A second approach will be the development of hidden Markov models to fill slots of a fact template based on the marked-up text. Information extracted will be presented to the user or expert curator. ? ? Public Description: The quality and pace of research depends upon rapid access to published information. This project will provide researchers with a search engine that rapidly gives them detailed, technical information they want by indexing the complete text of research articles. ? ? ?

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Good, Peter J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
California Institute of Technology
Schools of Arts and Sciences
United States
Zip Code
Van Auken, Kimberly; Schaeffer, Mary L; McQuilton, Peter et al. (2014) BC4GO: a full-text corpus for the BioCreative IV GO task. Database (Oxford) 2014:
Van Auken, Kimberly; Fey, Petra; Berardini, Tanya Z et al. (2012) Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford) 2012:bas040
Fang, Ruihua; Schindelman, Gary; Van Auken, Kimberly et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 13:16
Rangarajan, Arun; Schedl, Tim; Yook, Karen et al. (2011) Toward an interactive article: integrating journals and biological databases. BMC Bioinformatics 12:175
Van Auken, Kimberly; Jaffery, Joshua; Chan, Juancarlos et al. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics 10:228
Muller, Hans-Michael; Rangarajan, Arun; Teal, Tracy K et al. (2008) Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics 6:195-204
Chen, David; Muller, Hans-Michael; Sternberg, Paul W (2006) Automatic document classification of biological literature. BMC Bioinformatics 7:370