Textpresso information retrieval and extraction system for biological literature

Muller, Hans-Michael

Abstract

We developed an information retrieval and extraction system that processes the full text of biological papers. The system, called Textpresso, separates text into sentences, labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises approximately one hundred categories of terms, such as """"""""gene"""""""", """"""""regulation"""""""", """"""""human disease"""""""", """"""""brain area"""""""" etc., and also contains main Gene Ontology (GO) categories. Extraction of particular biological facts, such as gene-gene interactions, or the curation of GO cellular components, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences. Search engine for four literatures, C. elegans, Drosophila, Arabidopsis and Neuroscience have been established by us, and nine systems for other literatures have been developed by other groups around the world. The system will be further developed in many aspects. In collaboration with the respective model organism databases, we will set up literature search engine for zebrafish, rat and Dictyostelium and consider systems for important diseases such as cancer, Alzheimer's and AIDS. We will improve the quality of searchable full text by carrying super- and subscripts as well as special character information, and recognizing subsections of a paper. Website and system enhancement will include synonym searches, better website customization features (""""""""myTextpresso""""""""), browsing and searching a paper taxonomy, implementation of batch queries and notification of search result changes due to corpus changes. We will offer webservices for Textpresso and maintain a public subversion system for the software. Named entity recognition algorithms will be implemented to find new terms for the ontology from full text. We will work on the problem of high specificity of terms in the lexica, which reduces recall, and enable searches for GO annotations. Strategies for (semi-) automated literature curation include installing a paper triage system and first pass curation to identify where in a paper which relevant data types can be found. Automated curation tasks include producing connections between a paper and a biological entity such as gene. We will develop learning algorithms that discover new categories and lexica in text. We will improve our curation strategy of developing specialized curation categories that are used to retrieve specific data, and develop corresponding curator interfaces to automate the processing pipeline from full text to database. We will research and implement new, more semantically oriented ways of searching by combining latent semantic indexing with new similarity measures. Machine learning algorithms for classifying sentences and extracting information will be implemented using hidden Markov models. A new approach of finding categories and lexica using graph theory will be investigated.

Public Health Relevance

Narrative Biomedical researchers need to read or skim many thousands of scientific articles each year, more than is humanly possible. This project will extend and improve an automatic system, Textpresso, that finds relevant sentences within millions of sentences that likely contain crucial information. Textpresso also extracts some types of information automatically, making it possible to have organized databases of important information.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG004090-05
Application #: 7772342
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Good, Peter J

Project Start: 2006-03-23
Project End: 2012-01-31
Budget Start: 2010-02-01
Budget End: 2011-01-31
Support Year: 5
Fiscal Year: 2010
Total Cost: $326,303
Indirect Cost

Institution

Name: California Institute of Technology
Department
Type: Schools of Arts and Sciences
DUNS #: 009584210

City: Pasadena
State: CA
Country: United States
Zip Code: 91125

Related projects


NIH 2012 R01 HG	Textpresso information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$290,837
NIH 2011 R01 HG	Textpresso information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$332,732
NIH 2010 R01 HG	Textpresso information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$326,303
NIH 2009 R01 HG	Textpresso: information retrieval and extraction system for biological literature Muller, Hans-Michael / California Institute of Technology	$320,000
NIH 2008 R01 HG	Textpresso, information retrieval and extraction system for biological literature Sternberg, Paul Warren / California Institute of Technology	$285,766
NIH 2007 R01 HG	Textpresso, information retrieval and extraction system for biological literature Sternberg, Paul Warren / California Institute of Technology	$291,301
NIH 2006 R01 HG	Textpresso, an information retrieval and extraction system for biological literat Sternberg, Paul Warren / California Institute of Technology	$300,000

Publications

Van Auken, Kimberly; Schaeffer, Mary L; McQuilton, Peter et al. (2014) BC4GO: a full-text corpus for the BioCreative IV GO task. Database (Oxford) 2014:

Van Auken, Kimberly; Fey, Petra; Berardini, Tanya Z et al. (2012) Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford) 2012:bas040

Fang, Ruihua; Schindelman, Gary; Van Auken, Kimberly et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 13:16

Rangarajan, Arun; Schedl, Tim; Yook, Karen et al. (2011) Toward an interactive article: integrating journals and biological databases. BMC Bioinformatics 12:175

Van Auken, Kimberly; Jaffery, Joshua; Chan, Juancarlos et al. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics 10:228

Muller, Hans-Michael; Rangarajan, Arun; Teal, Tracy K et al. (2008) Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics 6:195-204

Chen, David; Muller, Hans-Michael; Sternberg, Paul W (2006) Automatic document classification of biological literature. BMC Bioinformatics 7:370

Comments

Be the first to comment on Hans-Michael Muller's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: