About the Project

The goal of the Hindi-Urdu Treebank (HUTB) project is to build a multi-representational and multi-layered treebank for Hindi and Urdu. It is a "Multi-representational" treebank in the sense that both dependency and phrase structure analyses are used for syntactic representation. It is a "Multi-layered" treebank since there are different layers of representation, where both syntax and lexical predicate-argument structure are represented. While multi-layered representations have become common, they are usually the combination of different projects at different points in time, each of which adds information to an existing resource. Multi-representational treebanks are less common. While phrase structure treebanks are often converted to dependency in order to train dependency parsers (and occasionally vice versa), the process usually results in a treebank which itself is undocumented. The meaning of the dependency representation can only be inferred from the algorithm which was used to derive it from the phrase structure representation; there is no independent linguistic motivation and documentation of the dependency representation (as there is for the phrase structure representation). Having both dependency structure and phrase structure treebank for the same data enhances the utility of the resource. For example, in the development of parsers, it is becoming increasingly clear that the proper choice of representation of the syntax of a language is itself a question of parsing research. Dependency Annotation is the first layer in multi-representational and multi-layered treebank. Dependency Annotation is based on Paninian Grammar Framework.

The project is a collaborative effort of five universities in two countries:

University of Colorado Boulder
Columbia University
University of Massachusetts at Amherst (UMass)
University of Washington (UW)
International Institute of Information Technology (IIIT) in Hyderabad, India.

This project was funded by NSF CISE-CRI CNS 0751202/0709167: Collaborative Research: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu.

Pre-release version of Hindi Dependency Treebank data is available for DOWNLOAD.

Please note that, only dependency annotated data is being released now.

Please register to download the data.


For updates on treebank data:

For discussion on treebank data: