The goal of this project is to provide automatic word sense disambiguation systems based on a principled English sense inventory geared to information processing needs. Polysemy, the many possible interpretations, or senses, of a word, is one of the major bottlenecks to accurate and focused information processing. Past attempts to use a public domain resource, WordNet, as a sense inventory for creating training data have not been successful due to vague and subtle sense distinctions that lead to poor inter-annotator agreement. We are experimenting with approaches to group fine-grained WordNet senses into more coarse-grained sense distinctions that can be annotated more rapidly and more accurately. Using linguistic evidence, we are refining our methodology for grouping word senses and our annotation process while creating large amounts of sense-tagged text. This new sense inventory has links to WordNet, FrameNet, and VerbNet, with clear criteria associated with the sense distinctions that facilitate accurate human sense tagging. Using the annotated data we are developing accurate supervised and semi-supervised automatic word sense disambiguation systems by experimenting with different machine learning algorithms and feature sets.
The sense inventory, the tagged data, and the trained systems will all be in the public domain for both national and international access, providing a stable English sense inventory geared to computational applications. The availability of broad coverage automatic word sense disambiguation systems will provide a major boost in performance to information retrieval, information extraction, question answering and machine translation, improving our ability to stay abreast of the information avalanche.