An information retrieval and extraction system that processes the full text of biological papers will be developed. A prototype system has been in operation at WormBase for over a year, used by C. elegans researchers as well as WormBase biological curators, and has recently been implemented for yeast at SGD. The system, called Textpresso, separates text into sentences, and labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises 37 categories of terms, such as 'gene,' 'regulation,' 'method,' etc. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a threefold increase of search efficiency. This system will be further developed in three ways. First, the core system will be refined and altered to allow expansion to multiple domains of interest, e.g., model organisms, human disease. Simple modifications to the system and website functionality will be made, including synonym, search phrases, and case-sensitivity. A software package for local installation will be supported. The project team will maintain the Textpresso site (www.textpresso.org). which will include C. elegans and pilot systems, but software package will be available for installation of Textpresso at local sites, e.g., SGD, Flybase etc. Second, the ontology will be structured somewhat more deeply and lexica expanded for organism and field specific terms. Third, algorithms for information extraction will be implemented. One approach will be the implementation of similarity measures using categories (high level nodes) of the Textpresso ontology to reduce the dimensionality of associated vector spaces. A second approach will be the development of hidden Markov models to fill slots of a fact template based on the marked-up text. Information extracted will be presented to the user or expert curator. Public Description: The quality and pace of research depends upon rapid access to published information. This project will provide researchers with a search engine that rapidly gives them detailed, technical information they want by indexing the complete text of research articles.
Van Auken, Kimberly; Schaeffer, Mary L; McQuilton, Peter et al. (2014) BC4GO: a full-text corpus for the BioCreative IV GO task. Database (Oxford) 2014: |
Van Auken, Kimberly; Fey, Petra; Berardini, Tanya Z et al. (2012) Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database (Oxford) 2012:bas040 |
Fang, Ruihua; Schindelman, Gary; Van Auken, Kimberly et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 13:16 |
Rangarajan, Arun; Schedl, Tim; Yook, Karen et al. (2011) Toward an interactive article: integrating journals and biological databases. BMC Bioinformatics 12:175 |
Van Auken, Kimberly; Jaffery, Joshua; Chan, Juancarlos et al. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics 10:228 |
Muller, Hans-Michael; Rangarajan, Arun; Teal, Tracy K et al. (2008) Textpresso for neuroscience: searching the full text of thousands of neuroscience research papers. Neuroinformatics 6:195-204 |
Chen, David; Muller, Hans-Michael; Sternberg, Paul W (2006) Automatic document classification of biological literature. BMC Bioinformatics 7:370 |