Treebanks are corpora of naturally occurring text that have been annotated with morphological and syntactic (structural) information. In the last 15 years they have led to significant advances in natural language processing (NLP) results by providing training data for supervised machine learning algorithms. These algorithms can now automatically perform useful part-of-speech tagging, parsing and semantic interpretation. This project is creating a new-generation, multi-representational Treebank. The languages being annotated are Hindi (400K words) and Urdu (200K words). The texts are being annotated in dependency structure (trees in which all nodes are labeled with words of the sentence), enriched with additional semantic role labels. The dependency representation is also being automatically mapped to a phrase-structure representation (in which the words are at the leaves of the tree and internal nodes are labeled with phrase markers). After applying standard quality-control both versions will be released to the public, providing an immediate boost to the performance of Hindi/Urdu NLP. A tool will also be released that will allow a researcher to produce alternative formatting of the phrase structure representation. This supports a view of the treebank as a more general, abstract representation of the morphology and syntax of the language rather than merely as data for a particular style of machine learning experiment. Research into parsing and other NLP tasks has recently recognized the benefits of reformatting syntactic representations in order to improve the machine learning process; this treebank will make that step much easier for all NLP researchers interested in Hindi or Urdu in particular and in language in general.

Project Report

’ was involved in the creation of a treebank for Hindi and Urdu. Treebanks are collections of texts that have been annotated to make various kinds of useful information explicit. Our treebank marks information about the syntactic structure of the sentences in it; it also includes information about verb meaning. The project involved a collaboration between four groups: the lead PI Martha Palmer at CU Boulder, co-PI Fei Xia au U Washington, co-PI Owen Rambow at Columbia, and co-PI Rajesh Bhatt at UMass Amherst. In addition, there was an international subcontract to a team led by Dipti Sharma at IIIT Hyderabad in India. The team at UMass Amherst was engaged in the development of the guidelines for phrase structure annotation. These have been completed and are available here: http://verbs.colorado.edu/hindiurdu/ All teams except the UMass team filed for a No-Cost-Extension which will end April 2014. We will submit a full report as a group then. In preparation for that, we have created a pre-release version that we are making available to interested researchers. Once some additional quality control is done and we have received initial feedback, the whole treebank will be released publicly. We expect this to happen by the time all the other teams complete their final extension. Even though the UMass team did not file for an extension, we have remained actively involved in discussions about publications, outreach, and quality control. The resulting treebank will be of use for the development of language technologies for Hindi and Urdu. It will also provide a model for the creation of similar resources for other South Asian languages, none of which currently have large scale treebanks. The treebank can also be used for linguistic research on Hindi and Urdu. The various guidelines written for the treebank as part of this project provide important documentation of the grammar of these languages.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0751171
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2008-05-01
Budget End
2012-04-30
Support Year
Fiscal Year
2007
Total Cost
$123,001
Indirect Cost
Name
University of Massachusetts Amherst
Department
Type
DUNS #
City
Amherst
State
MA
Country
United States
Zip Code
01003