This proposal describes a new tool for text data mining-a biomedical language ontology and integrated natural-language-processing methods. Our long-term goal is to provide resources for biomedical knowledge discovery from text. Our immediate goal is to provide a knowledge discovery tool for the curation of organism databases such as the Genome Database (SGD). The proposed research not only serves the research needs of the SGD community, it also helps the broader biomedical community exploit the strengths of the comparative approach to biological research. The hypothesis of this proposal is that knowledge discovery from biomedical text requires a knowledge base that integrates both genomic and linguistic information. This hypothesis is based on two observations: (a) the language of biomedicine, like all natural language, is complex in structure and morphology (the basic units of meaning) and poses problems of synonymy (several terms having the same meaning), polysemy (a term having more than one meaning), hypernymy (one term being more general than another), hyponymy (one term being more specific than another), denotation (what a term refers to in contrast to what it means), and denotation and description (different ways of referring to the same thing); and (b) important biomedical knowledge sources, such as the Gene Ontology (GO), are expressed in natural language.
The specific aims of the proposed project are to: 1. Extend an existing biomedical language ontology to include genomic and linguistic data from SGD; 2. Use this ontology to discover, in full-text articles made available by SGD, information about the molecular function of yeast gene products that can be inferred from direct experimental assays; 3. Evaluate the effectiveness of the new tool and methods by comparing its results to those of the SGD curators for gene products that have GO functional annotations with evidence code IDA (Inferred from Direct Assay).