III: Small: Collaborative Research: Building a Large Multilingual Semantic Network for Text Processing Applications

Tarau, Paul; Mihalcea, Rada

Abstract

This project is devoted to building a large multilingual semantic network through the application of novel techniques for semantic analysis specifically targeted at the Wikipedia corpus. The driving hypothesis of the project is that the structure of Wikipedia can be effectively used to create a highly structured graph of world knowledge in which nodes correspond to entities and concepts described in Wikipedia, while edges capture ontological relations such as hypernymy and meronymy. Special emphasis is given to exploiting the multilingual information available in Wikipedia in order to improve the performance of each semantic analysis tool. Significant research effort is therefore aimed at developing tools for word sense disambiguation, reference resolution and the extraction of ontological relations that use multilingual reinforcement and the consistent structure and focused content of Wikipedia to solve these tasks accurately. An additional research challenge is the effective integration of inherently noisy evidence from multiple Wikipedia articles in order to increase the reliability of the overall knowledge encoded in the global Wikipedia graph. Computing probabilistic confidence values for every piece of structural information added to the network is an important step in this integration, and it is also meant to provide increased utility for downstream applications. The proposed highly structured semantic network complements existing semantic resources and is expected to have a broad impact on a wide range of natural language processing applications in need of large scale world knowledge.

For further information, please see the project website: http://lit.csci.unt.edu/index.php/Mu.Se.Net

Project Report

In this project, we built semantic analysis tools specifically targeted at the Wikipedia corpus, with the aim of building a large multilingual semantic network in which edges connect entities or concepts that are related to one another through ontological relations such as hypernymy ("a book is a publication") or meronymy ("a book has chapters"). Each node is associated with lexicalizations in different languages, based on the multilingual information present in Wikipedia. The project resulted in several publications, datasets, and software systems, including: 1. A taxonomic relation extraction system and a database of taxonomic relations based on Wikipedia. The system was trained on data extracted from lists and revision histories in Wikipedia, with no manual supervision. The extracted graph database contains over 2 million entity nodes and 3 million relations between pairs of entities. 2. Supervised and semi-supervised learning approaches for multilingual word sense disambiguation and semi-supervised techniques for sense clustering. We explored the cumulative impact of features originating from multiple supporting languages on the task of word sense disambiguation, and built disambiguation systems for several languages. We also addressed the task of sense clustering in Wikipedia, using a rich feature space obtained from multilingual data, and built a system that can automatically determine if two word senses should be merged. 3. An adaptive clustering model for coreference resolution, addressing the task of clustering together nouns and pronouns that refer to the same discourse entity ("it" refers to a "book"). The clustering model improves over the expert rules of a state-of-the-art deterministic system by using the rules as features over pairs of clusters. Statistics from a large web n-gram corpus are used to compute semantic compatibility features (a "book" can "inspire", but a "book" cannot "eat"), leading to improved performance for pronoun resolution. All the publications, datasets, and systems are publicly available at http://lit.eecs.umich.edu/research/projects/musenet

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 1018613
Program Officer: Sylvia Spengler

Project Start
Project End
Budget Start: 2010-09-15
Budget End: 2014-08-31
Support Year
Fiscal Year: 2010
Total Cost: $275,336
Indirect Cost

III: Small: Collaborative Research: Building a Large Multilingual Semantic Network for Text Processing Applications
Tarau, Paul Mihalcea, Rada
University of North Texas, Denton, TX, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments