There has been growing interest in recent years in developing methods that automatically identify Gene Ontology (GO) concepts in the unstructured text of scientific articles. This interest is motivated in part by the need to automate the task of model-organism database curation. In addition, however, methods that automatically identify GO concepts in text will enable data mining tools that compile and interpret information extracted from text, tools that will benefit a large number of people across the scientific enterprise. This project builds on recently completed work in which we used the literature of S. cerevisiae and annotations in the Saccharomyces Genome Database (SGD) to develop methods that determine what molecular function claims are being made in an article and what experimental evidence there is in the article for those claims. The data generated in this project contains a wealth of information that could lead to greatly improved methods for identifying GO concepts in text.
The specific aims of this project are: (1) to develop a representation for GO molecular function concepts that captures information not only about the language of a GO term but also the biomedical entity the term refers to;and (2) to analyze the results of the S. cerevisiae data mining project using the GO representations formulated in (1) to determine which are likely to produce improved GO term recognition. The analysis will be performed on 276 true positive results, 29,276 false positive results, and 336 false negative results to see if a new GO concept representation can reduce the number of false positives or false negatives without losing any true positives. The data mining tools of this proposal can be extended to ontologies other than GO, thereby leveraging the effort expended on ontology development.