- POS tagged and Chunked corpus (for Hindi)
- 70,000 words.
You can access the annotation guidelines and tagset HERE
- 200,000 words are available in an older tagset but one can convert them to the newer tagset.
You can access the annotation guidelines HERE
- 2000 manually parsed sentences which are labeled with Paninian labels.
- Currently, we have a clean dataset of 52,000 sentence pairs. We are trying to collect more sentence pairs.
- We have acquired a large collection of parallel books. We are trying to get it cleaned and sentence aligned. The hope is to make it available to the participants of the workshop.
- 5000 sentences which have been word-aligned.
- 25000 words dictionary. It can be accessed HERE
- Multi-word dictionary containing 25000 words.
Named entity corpus (for Hindi)