CRI: CRD: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu

Palmer, Martha; Xue, Nianwen

Abstract

Treebanks are corpora of naturally occurring text that have been annotated with morphological and syntactic (structural) information. In the last 15 years they have led to significant advances in natural language processing (NLP) results by providing training data for supervised machine learning algorithms. These algorithms can now automatically perform useful part-of-speech tagging, parsing and semantic interpretation. This project is creating a new-generation, multi-representational Treebank. The languages being annotated are Hindi (400K words) and Urdu (200K words). The texts are being annotated in dependency structure (trees in which all nodes are labeled with words of the sentence), enriched with additional semantic role labels. The dependency representation is also being automatically mapped to a phrase-structure representation (in which the words are at the leaves of the tree and internal nodes are labeled with phrase markers). After applying standard quality-control both versions will be released to the public, providing an immediate boost to the performance of Hindi/Urdu NLP. A tool will also be released that will allow a researcher to produce alternative formatting of the phrase structure representation. This supports a view of the treebank as a more general, abstract representation of the morphology and syntax of the language rather than merely as data for a particular style of machine learning experiment. Research into parsing and other NLP tasks has recently recognized the benefits of reformatting syntactic representations in order to improve the machine learning process; this treebank will make that step much easier for all NLP researchers interested in Hindi or Urdu in particular and in language in general. OISE is co-funding the University of Colorado student exchange with the IIIT in Hyderabad, India where 400K words of Hindi and 200K words of Urdu will be annotated with dependency parses. This will enable an international research experience for U.S.students.

Project Report

, concerns the creation of a Hindi/Urdu multi-representational and multi-layered treebank. Automatic syntactic parsing is a key component of modern natural language processing systems, and automatic syntactic parsing has contributed to significant improvements in Machine Translation (as in Google Language Tools), Question Answering (as in IBMâ€™s Watson system), and Information Extraction that is currently being used for mining consumer opinions about products, everything from books and movies to clothes and electronic gadgets. There are two different linguistic theories of syntax that are both widely used for creating training data (Treebanks) for supervised machine learning systems that are used to develop syntactic parsers: dependency structure and phrase structure. Our "multi-representational" Treebank uses both dependency and phrase structure for syntactic representation so that both types of parsers could be trained on our data. We also provide predicate-argument structure annotation (PropBank) so that semantic role labeling systems can also be trained, to make explicit the participants in an event. In other words, "Who" did "what" to "whom," "when," "where," and "how?" We now have a 425,000 word Treebank for Hindi text with dependency structure annotation, and are wrapping up the PropBank annotation. When it is finished, we have a conversion procedure to automatically produce the phrase structure annotation. We are doing the same thing for 200, 000 words of Urdu. Transliteration between Hindi and Urdu will allow the two treebanks to be used together for either Hindi or Urdu. A key element of the approach is a commitment to the automatic conversion from the manual dependency structure treebank to the phrase structure treebank. In order to ensure successful conversion from dependency structure (DS) to phrase structure (PS), the guidelines for Hindi and Urdu dependency structure, phrase structure, and PropBank (PB) have been carefully synchronized (Bhatt et. al, 2011). This has fostered many in depth-discussions about various linguistic phenomena, and led to a much deeper understanding of the similarities and differences between dependency structure, phrase structure and predicate-argument structure. It also caused a delay in the original schedule of deliverables, but fortunately additional supplemental funding ($93,000) was received in 2012 to continue the annotation, in particular the annotation of Urdu. Since the PropBank annotation of necessity comes after the treebanking is completed, the PropBanking is still not finished. However, IIIT and Colorado will continue working on the Hindi and Urdu PropBanking to try and complete the annotation. The Urdu PropBank Frame Files for the PropBanking are almost complete. The Hindi Treebank data was released to the research community on July 6, 2014. Over 35 publications have arisen from this effort. The treebank has already been used to train state-of-the-art syntactic parsers for Hindi, and this is a major advancement to information processing of Hindi. Access to documents and web pages in Hindi and Urdu can spur economic development and foster cultural exchanges, having a far reaching positive impact outside of the computational linguistics community. During the course of the project there have been several preliminary data releases that were made available for participants in shared tasks. These included the following. Year Workshop Conference Data amount 2010 ICON 2010 Shared Task 150K 2010 South Asian Syntax and Semantics, Amherst, MA 150K 2012 Machine Translation and Parsing In Indian Languages (MTPIL) COLING 2012 2012 COLING Tutorial New Frontiers in Hindi and Urdu Natural Language Processing The 425K Hindi Treebank has been made publically available for download at this site, and the following announcement has been circulated: http://ltrc.iiit.ac.in/treebank_H2014/ From: portal@aclweb.org Subject: [ACL Member Portal] Pre-release of Hindi Dependency Treebank Date: July 6, 2014 at 8:42:36 PM GMT+2 To: martha.palmer@colorado.edu We are making available to researchers a 425K word Hindi Dependency Treebank. This project was funded by NSF CISE-CRI CNS 0751202/0709167: Collaborative Research: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu. The grant investigators include Martha Palmer, Dipti Sharma, Rajesh Bhatt, Owen Rambow and Fei Xia. All of the annotation of the Hindi Treebank being released now was done at IIIT-Hyderabad under the leadership of Dipti Sharma. The goal has been to develop a Hindi and Urdu multi-representational and multi-layered treebanks, that include both dependency and phrase structure as syntactic representation, and both Paninian and PropBank style semantic role labels as semantic representations. The guidelines for the dependency structure annotation have been synchronized with the phrase structure guidelines to facilitate automatic conversion. The PropBank guidelines have been extended to include elements that help guide the conversion to phrase structure. T he Urdu data with its annotations, and the additional layers and representations for Hindi will also be released when they are completed. The pre-release version of the Hindi Dependency Treebank is available for download. The link for downloading the data is http://ltrc.iiit.ac.in/treebank_H2014/

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Application #: 0751202
Program Officer: Tatiana D. Korelsky

Project Start
Project End
Budget Start: 2008-05-01
Budget End: 2014-04-30
Support Year
Fiscal Year: 2007
Total Cost: $733,029
Indirect Cost

CRI: CRD: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu
Palmer, Martha Xue, Nianwen
University of Colorado at Boulder, Boulder, CO, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments