TCS NLP Winter School 2008

70,000 words.
You can access the annotation guidelines and tagset HERE
200,000 words are available in an older tagset but one can convert them to the newer tagset.
You can access the annotation guidelines HERE

Dependency Treebank
- 2000 manually parsed sentences which are labeled with Paninian labels.

Parallel Corpus
- Currently, we have a clean dataset of 52,000 sentence pairs. We are trying to collect more sentence pairs.
- We have acquired a large collection of parallel books. We are trying to get it cleaned and sentence aligned. The hope is to make it available to the participants of the workshop.
- 5000 sentences which have been word-aligned.

Dictionaries
- 25000 words dictionary. It can be accessed HERE
- Multi-word dictionary containing 25000 words.