TCS NLP Winter School 2008

24 December, 2007 - 7 January, 2008

Collocated with IJCNLP 2008 at IIIT, Hyderabad, India


Home
IJCNLP Home
To Apply
Fees
Schedule of Lectures
List of Projects
Resources
Sponsors
Organizers
Contact
Important local info
About Hyderabad
Venue Maps


Morphological Analysis

Morphological analysis is an important step while processing Indian languages. In this project, the goal will to develop and test morphological analyzers for Indian languages. A range of techniques will be tried to develop both rule-based and unsupervised analyzers.


Comparison of an Unsupervised Morph Analyzer with a Rule based Morph Analyzer

The goal of this project is to build an unsupervised Morphological Analyzer and compare its output with the analysis produced by a rule-based morph analyzer. Various aspects would be dealt such as accuracy and coverage.

Guide: Srinivas Bangalore (AT&T Research Labs)
Mentors: Sriram Venkatapathy (IIIT-H)

Team
Parminder Singh,
Gurunanak Univ., Punjab
N Kalyani,
G Narayanamma Institute,
AnuSys-11
Ankur Garg,
CDAC-Noida,
AnuSys-12
K V N Sunita,
G Narayanamma Institute,
AnuSys-11

Team
Balaram Prasain,
Tribhuvan University,
AnuSys-7
Rajeev R R,
Tamil University
Asanka Wasala,
University of Colombo,
AnuSys-8
Pramod Gupta,
CDAC-Noida,
AnuSys-7

Resources :
  • An open-source rule-based Morphological Analyzer (follows Paradigm approach)
  • 1.2 million words clean CIIL Hindi Corpus
Reading Assignments:
  1. Utpal Sharma, Jugal Kalita and Rajib Das. 2002. Unsupervised Learning of Morphology for Building Lexicon. for a Highly Inflectional Language. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.
  2. Yu Hu, Irina Matveeva, John Goldsmith and Colin Sprague. 2005. Using Morphology and Syntax Together in Unsupervised Learning. In Proceedings of the Workshop on Psychocomputational Models of Human Language Acquisition.
  3. Poor Man's Stemming: Unsupervised Recognition of Same-Stem Words.
Additional papers to read:
  1. Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Natural Language Processing: A Paninian Perspective
List of Experiments to be performed :
  1. * To be finalized

 

Semi-supervised Morphological Analysis by using a Rule-based system as a seed.

The goal of this project is to improve a Hindi rule-based morphological analyzer using a raw corpus. Rule-based morphological analyzers have a fairly low coverage. We would be trying to improve its coverage using a large Hindi corpus.

Guide: Srinivas Bangalore (AT&T Research Labs)
Mentors: Sriram Venkatapathy (IIIT-H)

Team 1
Viraj Welgama,
University of Colombo,
AnuSys-22
Prateek Bhatia,
Thapar University, Patiala.
-Vasudevan,
IIT-Bombay,
AnuSys-21

Team 2
Vishal Goyal,
Punjab Univ.,
AnuSys-19
 -
D V Sriram,
IIIT-Hyderabad,
AnuSys-16
Krishna Kumar,
Tamil University

Resources:
  • A rule-based morphological analyzer for Hindi.
  • 1.2 million words clean CIIL Hindi Corpus
Reading Assignment:
  1. Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Natural Language Processing: A Paninian Perspective
  2. Akshar Bharati, Rajeev Sangal, Sushma Bendre, Pavan Kumar, Aishwarya. Unsupervised Improvement of Morphological Analyzer for Inflectionally Rich Languages.
Additional papers to read :
  1. Utpal Sharma, Jugal Kalita and Rajib Das. 2002. Unsupervised Learning of Morphology for Building Lexicon. for a Highly Inflectional Language. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning.
 

Comparison of FST tools for Morphological Analysis

Guide: Amba Kulkarni (University of Hyd)
Mentors:

Team
Ashwini Vaidya,
IIIT-Hyderabad,
Sys-12
Renjini Narendranath,
IIIT-Hyderabad,
Sys-12
Gowri Dev,
IIIT-Hyderabad,
Sys-15
Thennarasu Sakkan,
University of Hyderabad,
Sys-15

Resources: Reading Assignments:
  1. Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Natural Language Processing: A Paninian Perspective (Chapter on Morphological Analysis)
  2. APERTIUM Documentation
  3. FLAN Documentation
List of Experiments to be performed :
  1. Run FLAN as well as Apertium for the existing paradigms
  2. Compare the performance of original morph, FLAN and Apertium based on following parameters:
    • performance on random texts
      • execution time
      • coverage
  • How easy/difficult to adapt the FSTs for handling 'vowel harmony' (as in Telugu/Marathi), and derivational morphology (Telugu).
  •