About the Project
The goal of the Hindi-Urdu Treebanking (HUTB) project is to build multi-representational and multi-layered treebanks for Hindi and Urdu. These treebanks are "Multi-representational" in the sense that both dependency and phrase structure analyses are used for syntactic representation. They are "Multi-layered" since there are different layers of representation, where both syntax and lexical predicate-argument structure are represented. While multi-layered representations have become common, they are usually the combination of different projects at different points in time, each of which adds information to an existing resource. Multi-representational treebanks are less common. While phrase structure treebanks are often converted to dependency in order to train dependency parsers (and occasionally vice versa), the process usually results in a treebank which itself is undocumented. The meaning of the dependency representation can only be inferred from the algorithm which was used to derive it from the phrase structure representation; there is no independent linguistic motivation and documentation of the dependency representation (as there is for the phrase structure representation). Having both dependency structure and phrase structure treebank for the same data enhances the utility of the resource. For example, in the development of parsers, it is becoming increasingly clear that the proper choice of representation of the syntax of a language is itself a question of parsing research.
Urder the HUTB project, dependency annotation forms the first layer of syntactic analyses. Phrase structure and PropBank annotations are done on top of the dependency trees. The dependency annotations are based on Paninian Grammar Framework. Related Publications that provide comprehensive details about the HUTB treebanking project.
The project is a collaborative effort of five universities in two countries:
- University of Colorado Boulder
- Columbia University
- University of Massachusetts at Amherst (UMass)
- University of Washington (UW)
- International Institute of Information Technology (IIIT) in Hyderabad, India.
This project was funded by NSF CISE-CRI CNS 0751202/0709167: Collaborative Research: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu.