TCS NLP Winter School 2008

24 December, 2007 - 7 January, 2008

Collocated with IJCNLP 2008 at IIIT, Hyderabad, India

Machine Translation (from English to Hindi & from Hindi to English)
The goal of this project is to develop an English-to-Hindi Statistical Machine Translation System. A medium sized parallel dataset will be provided to the participants to train their systems. A testing set will be provided using which the performance of the systems will be measured. Some of the sub-projects that the participants can pursue are,

Statistical Phrase-based Machine Translation
The goal here will be to tune an existing Phrase-based machine translation system to the Indian language setting. Phrase-based systems do not take the morphological richess of Indian languages and the word order-variations that exist between English and Indian languages.

In this project, a number of experiments will be conducted to take advantage of the rich morphology of Indian Languages within the the framework of Phrase based machine translation.

Guide: Srinivas Bangalore (AT&T Research Labs, NJ, USA)
Mentors: Sriram Venkatapathy (IIIT-H)

Team 1
Sachin Anklekar, CDAC-Mumbai, Sys-19	Sriram Chaudhary, IIIT-Hyderabad, Sys-Anu-18
Niraj Shreshta, Katmandu University, Sys-19	-

Team 2
Vimal, CDAC-Noida, Sys-20	Sunny Sharma, Delhi University, Sys-20
Tarak Ram, IIIT-Hyderabad, Sys-18	Bindu Madhavi, University of Hyderabad, Sys-18

Resources :

52,000 English-Hindi Sentence Pairs (Refined Dataset)
400,000 English-Hindi Sentence Pairs (Noisy)
MOSES: Open-Source Phrase-based Translation System
Morphological Analyzer for English and Hindi
POS-taggers for English and Hindi

Reading Assignments :

Philipp Koehn, Franz Josef Och, and Daniel Marcu. (2003). Statistical Phrase-Based Translation. HLT/NAACL 2003
Philipp Koehn and Hieu Hoang. Factored Translation Models, Conference on Empirical Methods in Natural Language Processing (EMNLP), Prague, Czech Republic, June 2007.
Web Tutorial on training MOSES
Goldwater and D. McCloskey. 2005. Improving statistical MT through morphological analysis. In Proceedings of HLT/EMNLP - 2005.

Additional papers to read :

Och, F. J. (2003). Minimum error rate training for statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL).
Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics, 30(4):417:450, December.
R. Zens and H. Ney. 2004. Improvements in phrase-based statistical machine translation. In Proceedings of HLT-NAACL, pages 257:264, Boston, MA.

List of Experiments to be performed :

* To be finalized

Syntax-based Machine Translation
Syntax-based approaches are well designed to handle large word-order variations between languages and hence, they seem more appropriate for developing systems between English and Indian languages. The goal of this project will be to extend these approaches to obtain better translational accuracies.

Several experiments will be conducted to evaluate the effectiveness of syntax for sentence re-construction.

Guide: Srinivas Bangalore (AT&T Research Labs, NJ, USA)
Mentors: Sriram Venkatapathy (IIIT-H)

Team 1
Alok Dadhekar, CDAC Mumbai, Sys-21	Garima Kukreja, Delhi University, Sys-21
Avinesh, IIIT-Hyderabad, Sys-23	K. Rajyarama, University of Hyd., Sys-16
Gour Mohan, CDAC-Noida, Sys-16	-

Team 2
Prashanth Mathur, IIIT-Hyderabad, Sys-24	Kolte Sopan Govind, Bharati Vidhyapeeth, Pune, Sys-24
Kailash Kattalay, Fuji Academy, Sys-23	Saurabh Kushwaha, CDAC-Mumbai, Sys-17
Anil Kumar, CDAC-Noida, Sys-17	-

Resources :

52,000 English-Hindi Sentence Pairs (Refined Dataset)
400,000 English-Hindi Sentence Pairs (Noisy)
Wide-coverage parser for English
Limited coverage parser for Hindi
Supertagger for English

Reading Assignments :

Yamada, K. and Knight, K. (2001). A syntax-based statistical translation model. In Proceedings of the 39th Annual Meeting of the Association of Computational Linguistics (ACL).
Menezes, A. and Quirk, C. (2005). Microsoft research treelet translation system: IWSLT evaluation. In Proc. of the International Workshop on Spoken Language Translation.
Sriram Venkatapathy and Srinivas Bangalore. 2007. Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction. In Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation, Rochester, USA.

Additional papers to read :

Alshawi, H., Bangalore, S., and Douglas, S. (1998). Automatic acquisition of hierarchical transduction models for machine translation. In Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics (ACL).
Wu, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3).

List of Experiments to be performed :

* To be finalized

Sentence Construction after Global Lexical Selection
Global lexical selection is a technique proposed recently which considers the entire source sentence while predicting every word in the target language. Then these target language words are arranged in an appropriate order to obtain a well-formed target sentence. Global lexical selection has been shown to deliver good lexical selection accuracies. In this project, the goal will be to develop well-performing sentence construction algorithms.

Guide: Srinivas Bangalore (AT&T Research Labs, NJ, USA)
Mentors: Sriram Venkatapathy (IIIT-H)

Team
Karthik Gali, IIIT-Hyderabad, Sys-26	Riya Singh, NIT - Surathkal, Sys-26
Latha Nair, CUSAT, Cochin, Sys-25	Vipul Mittal, IIIT-Hyderabad, Sys-25

Resources :

52,000 English-Hindi Sentence Pairs (Refined Dataset)
400,000 English-Hindi Sentence Pairs (Noisy)
English-French Europarl Corpus
Maximum Entropy Toolkit

Reading Assignments :

Srinivas Bangalore, Patrick Haffner and Stephan Kanthak. 2007. Statistical Machine Translation through Global Lexical Selection and Sentence Reconstruction. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic.
Sriram Venkatapathy and Srinivas Bangalore. 2007. Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction. In Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation, Rochester, USA.
Tutorials on Maximum Entropy Modeling

List of Experiments to be performed :

* To be finalized