Textpresso, an information retrieval and extraction system for biological literat

Sternberg, Paul

Abstract

? ? An information retrieval and extraction system that processes the full text of biological papers will be ? developed. A prototype system has been in operation at WormBase for over a year, used by C. elegans ? researchers as well as WormBase biological curators, and has recently been implemented for yeast at SGD. The system, called Textpresso, separates text into sentences, and labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises 37 categories of terms, such as """"""""gene,"""""""" """"""""regulation,"""""""" """"""""method,"""""""" etc. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a threefold increase of search efficiency. This system will be further developed in three ways. First, the core system will be refined and altered to allow expansion to multiple domains of interest, e.g., model organisms, human disease. Simple modifications to the system and website functionality will be made, including synonym, search phrases, and case-sensitivity. A software package for local installation will be supported. The project team will maintain the Textpresso site (www.textpresso.org). which will include C. elegans and pilot systems, but software package will be available for installation of Textpresso at local sites, e.g., SGD, Flybase etc. Second, the ontology will be structured somewhat more deeply and lexica expanded for organism and field ? specific terms. Third, algorithms for information extraction will be implemented. One approach will be the implementation of similarity measures using categories (high level nodes) of the Textpresso ontology to reduce the dimensionality of associated vector spaces. A second approach will be the development of hidden Markov models to fill slots of a fact template based on the marked-up text. Information extracted will be presented to the user or expert curator. ? ? Public Description: The quality and pace of research depends upon rapid access to published information. This project will provide researchers with a search engine that rapidly gives them detailed, technical information they want by indexing the complete text of research articles. ? ? ?

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 1R01HG004090-01
Application #: 7047977
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Good, Peter J

Project Start: 2006-03-23
Project End: 2009-01-31
Budget Start: 2006-03-23
Budget End: 2007-01-31
Support Year: 1
Fiscal Year: 2006
Total Cost: $300,000
Indirect Cost

Institution

Name: California Institute of Technology
Department
Type: Schools of Arts and Sciences
DUNS #: 009584210

City: Pasadena
State: CA
Country: United States
Zip Code: 91125

Related projects


NIH 2012 R01 HG	Textpresso information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$290,837
NIH 2011 R01 HG	Textpresso information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$332,732
NIH 2010 R01 HG	Textpresso information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$326,303
NIH 2009 R01 HG	Textpresso: information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$320,000
NIH 2008 R01 HG	Textpresso, information retrieval and extraction system for biological literature Sternberg, Paul Warren / California Institute of Technology	$285,766
NIH 2007 R01 HG	Textpresso, information retrieval and extraction system for biological literature Sternberg, Paul Warren / California Institute of Technology	$291,301
NIH 2006 R01 HG	Textpresso, an information retrieval and extraction system for biological literat Sternberg, Paul Warren / California Institute of Technology	$300,000

Publications

Van Auken, Kimberly; Schaeffer, Mary L; McQuilton, Peter et al. (2014) BC4GO: a full-text corpus for the BioCreative IV GO task. Database (Oxford) 2014:

Van Auken, Kimberly; Fey, Petra; Berardini, Tanya Z et al. (2012) Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford) 2012:bas040

Fang, Ruihua; Schindelman, Gary; Van Auken, Kimberly et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 13:16

Rangarajan, Arun; Schedl, Tim; Yook, Karen et al. (2011) Toward an interactive article: integrating journals and biological databases. BMC Bioinformatics 12:175

Van Auken, Kimberly; Jaffery, Joshua; Chan, Juancarlos et al. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics 10:228

Muller, Hans-Michael; Rangarajan, Arun; Teal, Tracy K et al. (2008) Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics 6:195-204

Chen, David; Muller, Hans-Michael; Sternberg, Paul W (2006) Automatic document classification of biological literature. BMC Bioinformatics 7:370

Comments

Be the first to comment on Paul Sternberg's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: