We developed an information retrieval and extraction system that processes the full text of biological papers. The system, called Textpresso, separates text into sentences, labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises approximately one hundred categories of terms, such as """"""""gene"""""""", """"""""regulation"""""""", """"""""human disease"""""""", """"""""brain area"""""""" etc., and also contains main Gene Ontology (GO) categories. Extraction of particular biological facts, such as gene-gene interactions, or the curation of GO cellular components, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences. Search engine for four literatures, C. elegans, Drosophila, Arabidopsis and Neuroscience have been established by us, and nine systems for other literatures have been developed by other groups around the world. The system will be further developed in many aspects. In collaboration with the respective model organism databases, we will set up literature search engine for zebrafish, rat and Dictyostelium and consider systems for important diseases such as cancer, Alzheimer's and AIDS. We will improve the quality of searchable full text by carrying super- and subscripts as well as special character information, and recognizing subsections of a paper. Website and system enhancement will include synonym searches, better website customization features (""""""""myTextpresso""""""""), browsing and searching a paper taxonomy, implementation of batch queries and notification of search result changes due to corpus changes. We will offer webservices for Textpresso and maintain a public subversion system for the software. Named entity recognition algorithms will be implemented to find new terms for the ontology from full text. We will work on the problem of high specificity of terms in the lexica, which reduces recall, and enable searches for GO annotations. Strategies for (semi-) automated literature curation include installing a paper triage system and first pass curation to identify where in a paper which relevant data types can be found. Automated curation tasks include producing connections between a paper and a biological entity such as gene. We will develop learning algorithms that discover new categories and lexica in text. We will improve our curation strategy of developing specialized curation categories that are used to retrieve specific data, and develop corresponding curator interfaces to automate the processing pipeline from full text to database. We will research and implement new, more semantically oriented ways of searching by combining latent semantic indexing with new similarity measures. Machine learning algorithms for classifying sentences and extracting information will be implemented using hidden Markov models. A new approach of finding categories and lexica using graph theory will be investigated.

Public Health Relevance

Narrative Biomedical researchers need to read or skim many thousands of scientific articles each year, more than is humanly possible. This project will extend and improve an automatic system, Textpresso, that finds relevant sentences within millions of sentences that likely contain crucial information. Textpresso also extracts some types of information automatically, making it possible to have organized databases of important information.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG004090-06
Application #
8034342
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Bonazzi, Vivien
Project Start
2006-03-23
Project End
2013-08-31
Budget Start
2011-03-23
Budget End
2013-08-31
Support Year
6
Fiscal Year
2011
Total Cost
$332,732
Indirect Cost
Name
California Institute of Technology
Department
Type
Schools of Arts and Sciences
DUNS #
009584210
City
Pasadena
State
CA
Country
United States
Zip Code
91125
Van Auken, Kimberly; Schaeffer, Mary L; McQuilton, Peter et al. (2014) BC4GO: a full-text corpus for the BioCreative IV GO task. Database (Oxford) 2014:
Van Auken, Kimberly; Fey, Petra; Berardini, Tanya Z et al. (2012) Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford) 2012:bas040
Fang, Ruihua; Schindelman, Gary; Van Auken, Kimberly et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 13:16
Rangarajan, Arun; Schedl, Tim; Yook, Karen et al. (2011) Toward an interactive article: integrating journals and biological databases. BMC Bioinformatics 12:175
Van Auken, Kimberly; Jaffery, Joshua; Chan, Juancarlos et al. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics 10:228
Muller, Hans-Michael; Rangarajan, Arun; Teal, Tracy K et al. (2008) Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics 6:195-204
Chen, David; Muller, Hans-Michael; Sternberg, Paul W (2006) Automatic document classification of biological literature. BMC Bioinformatics 7:370