This project is devoted to building a large multilingual semantic network through the application of novel techniques for semantic analysis specifically targeted at the Wikipedia corpus. The driving hypothesis of the project is that the structure of Wikipedia can be effectively used to create a highly structured graph of world knowledge in which nodes correspond to entities and concepts described in Wikipedia, while edges capture ontological relations such as hypernymy and meronymy. Special emphasis is given to exploiting the multilingual information available in Wikipedia in order to improve the performance of each semantic analysis tool. Significant research effort is therefore aimed at developing tools for word sense disambiguation, reference resolution and the extraction of ontological relations that use multilingual reinforcement and the consistent structure and focused content of Wikipedia to solve these tasks accurately. An additional research challenge is the effective integration of inherently noisy evidence from multiple Wikipedia articles in order to increase the reliability of the overall knowledge encoded in the global Wikipedia graph. Computing probabilistic confidence values for every piece of structural information added to the network is an important step in this integration, and it is also meant to provide increased utility for downstream applications. The proposed highly structured semantic network complements existing semantic resources and is expected to have a broad impact on a wide range of natural language processing applications in need of large scale world knowledge.
For further information, please see the project website: http://lit.csci.unt.edu/index.php/Mu.Se.Net
In this project, we built semantic analysis tools specifically targeted at the Wikipedia corpus, with the aim of building a large multilingual semantic network in which edges connect entities or concepts that are related to one another through ontological relations such as hypernymy ("a book is a publication") or meronymy ("a book has chapters"). Each node is associated with lexicalizations in different languages, based on the multilingual information present in Wikipedia. The project resulted in several publications, datasets, and software systems, including: 1. A taxonomic relation extraction system and a database of taxonomic relations based on Wikipedia. The system was trained on data extracted from lists and revision histories in Wikipedia, with no manual supervision. The extracted graph database contains over 2 million entity nodes and 3 million relations between pairs of entities. 2. Supervised and semi-supervised learning approaches for multilingual word sense disambiguation and semi-supervised techniques for sense clustering. We explored the cumulative impact of features originating from multiple supporting languages on the task of word sense disambiguation, and built disambiguation systems for several languages. We also addressed the task of sense clustering in Wikipedia, using a rich feature space obtained from multilingual data, and built a system that can automatically determine if two word senses should be merged. 3. An adaptive clustering model for coreference resolution, addressing the task of clustering together nouns and pronouns that refer to the same discourse entity ("it" refers to a "book"). The clustering model improves over the expert rules of a state-of-the-art deterministic system by using the rules as features over pairs of clusters. Statistics from a large web n-gram corpus are used to compute semantic compatibility features (a "book" can "inspire", but a "book" cannot "eat"), leading to improved performance for pronoun resolution. All the publications, datasets, and systems are publicly available at http://lit.eecs.umich.edu/research/projects/musenet