CRI: CRD: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu

Xia, Fei

Abstract

Treebanks are corpora of naturally occurring text that have been annotated with morphological and syntactic (structural) information. In the last 15 years they have led to significant advances in natural language processing (NLP) results by providing training data for supervised machine learning algorithms. These algorithms can now automatically perform useful part-of-speech tagging, parsing and semantic interpretation. This project is creating a new-generation, multi-representational Treebank. The languages being annotated are Hindi (400K words) and Urdu (200K words). The texts are being annotated in dependency structure (trees in which all nodes are labeled with words of the sentence), enriched with additional semantic role labels. The dependency representation is also being automatically mapped to a phrase-structure representation (in which the words are at the leaves of the tree and internal nodes are labeled with phrase markers). After applying standard quality-control both versions will be released to the public, providing an immediate boost to the performance of Hindi/Urdu NLP. A tool will also be released that will allow a researcher to produce alternative formatting of the phrase structure representation. This supports a view of the treebank as a more general, abstract representation of the morphology and syntax of the language rather than merely as data for a particular style of machine learning experiment. Research into parsing and other NLP tasks has recently recognized the benefits of reformatting syntactic representations in order to improve the machine learning process; this treebank will make that step much easier for all NLP researchers interested in Hindi or Urdu in particular and in language in general.

Project Report

, concerns the creation of a Hindi/Urdu multi-representational and multi-layered treebank. Automatic syntactic parsing is a key component of modern natural language processing systems, and automatic syntactic parsing has contributed to significant improvements in machine translation (as in Google Language Tools), Question Answering (e.g., IBMâ€™s Watson system), and information extraction (e.g., mining consumer opinions about products and issues). Automatic parsers are usually created using supervised machine learning; the training data is a treebank, which is independently occurring text with manual syntactic annotations. There are two different linguistic theories of how to represent syntax that are both widely used for creating treebanks: dependency structure and phrase structure. Our "multi-representational" treebank uses both dependency and phrase structure for syntactic representation; as a result, both types of parsers could be trained on our data. We also provide predicate-argument structure annotation (PropBank), so that semantic role labeling systems can also be trained, to make explicit the participants in an event. In other words, "Who" did "what" to "whom," "when," "where," and "how?" We now have a 425,000-word treebank for Hindi text with dependency structure annotation, and are wrapping up the PropBank annotation. When it is finished, we have a conversion procedure to automatically produce the phrase structure annotation. We are doing the same thing for 200, 000 words of Urdu. Transliteration between Hindi and Urdu will allow the two treebanks to be used together for either Hindi or Urdu. A key element of the approach is a commitment to the automatic conversion from the manual dependency structure treebank to the phrase structure treebank. In order to ensure successful conversion from dependency structure (DS) to phrase structure (PS), the guidelines for Hindi and Urdu dependency structure, phrase structure, and PropBank (PB) have been carefully synchronized (Bhatt et. al, 2011). This has fostered many in depth-discussions about various linguistic phenomena, and led to a much deeper understanding of the similarities and differences between dependency structure, phrase structure and predicate-argument structure. It also caused a delay in the original schedule of deliverables, but fortunately additional supplemental funding ($93,000) was received in 2012 to continue the annotation, in particular the annotation of Urdu. Since the PropBank annotation of necessity comes after the treebanking is completed, the PropBanking is still not finished. Nevertheless, IIIT and Colorado will continue working on the Hindi and Urdu PropBanking to try and complete the annotation. The Urdu PropBank Frame Files for the PropBanking are almost complete. The Hindi Treebank data was released to the research community on July 6, 2014. Over 35 publications have arisen from this effort. The treebank has already been used to train state-of-the-art syntactic parsers for Hindi, and this is a major advancement to information processing of Hindi. Access to documents and web pages in Hindi and Urdu can spur economic development and foster cultural exchanges, having a far-reaching positive impact outside of the computational linguistics community. During the course of the project there have been several preliminary data releases that were made available for participants in shared tasks. These included the following. Year Workshop Conference Data amount 2010 South Asian Syntax and Semantics, Amherst, MA ICON 2010 Shared Task 150K 2012 Machine Translation and Parsing in Indian Languages COLING Tutorial "New Frontiers in Hindi and Urdu Natural Language Processing" 295K The 425K-word Hindi Treebank has been made publically available for download at this site: http://ltrc.iiit.ac.in/treebank_H2014/, and the following announcement has been circulated: From: portal@aclweb.org Subject: [ACL Member Portal] Pre-release of Hindi Dependency Treebank Date: July 6, 2014 at 8:42:36 PM GMT+2 We are making available to researchers a 425K word Hindi Dependency Treebank. This project was funded by NSF CISE-CRI CNS 0751202/0709167: Collaborative Research: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu. The grant investigators include Martha Palmer, Dipti Sharma, Rajesh Bhatt, Owen Rambow and Fei Xia. All of the annotation of the Hindi Treebank being released now was done at IIIT-Hyderabad under the leadership of Dipti Sharma. The goal has been to develop a Hindi and Urdu multi-representational and multi-layered treebanks, that include both dependency and phrase structure as syntactic representation, and both Paninian and PropBank style semantic role labels as semantic representations. The guidelines for the dependency structure annotation have been synchronized with the phrase structure guidelines to facilitate automatic conversion. The PropBank guidelines have been extended to include elements that help guide the conversion to phrase structure. The Urdu data with its annotations, and the additional layers and representations for Hindi will also be released when they are completed. The pre-release version of the Hindi Dependency Treebank is available for download. The link for downloading the data is http://ltrc.iiit.ac.in/treebank_H2014/.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Application #: 0751213
Program Officer: Tatiana D. Korelsky

Project Start
Project End
Budget Start: 2008-05-01
Budget End: 2014-04-30
Support Year
Fiscal Year: 2007
Total Cost: $196,000
Indirect Cost

CRI: CRD: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu
Xia, Fei
University of Washington, Seattle, WA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments