TCS NLP Winter School 2008

24 December, 2007 - 7 January, 2008

Collocated with IJCNLP 2008 at IIIT, Hyderabad, India

To Apply
Schedule of Lectures
List of Projects
Important local info
About Hyderabad
Venue Maps


  1. POS tagged and Chunked corpus (for Hindi)
    • 70,000 words.
      You can access the annotation guidelines and tagset HERE
    • 200,000 words are available in an older tagset but one can convert them to the newer tagset.
      You can access the annotation guidelines HERE

  2. Dependency Treebank
    • 2000 manually parsed sentences which are labeled with Paninian labels.

  3. Parallel Corpus
    • Currently, we have a clean dataset of 52,000 sentence pairs. We are trying to collect more sentence pairs.
    • We have acquired a large collection of parallel books. We are trying to get it cleaned and sentence aligned. The hope is to make it available to the participants of the workshop.
    • 5000 sentences which have been word-aligned.

  4. Dictionaries
    • 25000 words dictionary. It can be accessed HERE
    • Multi-word dictionary containing 25000 words.

  5. Named entity corpus (for Hindi)

  6. Hindi Wordnet