IJCNLP 2008

5th November, 2007: The list of accepted papers is being finalized and the notification mails will be sent shortly. The list will also be posted on this site. There might be a slight delay in some cases because some of the reviews are still awaited.

27th October, 2007: Notification date extended to 5th November, 2007

25th September, 2007: New version of the evaluation script made available. This version gives details about individual tags and also calculates precision etc. based on lenient (lexical) matching. Please check and inform in case of any error.

20th September, 2007: Deadline for paper submission extended to 25th September (11:59 pm PST).

17th September, 2007: Annotated test data uploaded.

14th September, 2007: A Frequently Asked Questions (FAQ) section started.

14th September, 2007: Test data released for Urdu.

13th September, 2007: Test data released for Hindi, Bengali, Oriya and Telugu.

13th September, 2007: Information about submission of test result data added.

9th September, 2007: Data is now available in ASCII based notations for all the five languages.

30th August, 2007: Training data released for Urdu.

28th August, 2007: Training data released for Telugu.

25th August, 2007: Training data released for Oriya.

23rd August, 2007: Training data released for Bengali.

14th August, 2007: Training data released for Hindi.

10th July, 2007: Workshop site launched.

View Flyer
JPG, PDF

Important Dates

Release of Training Data: Aug 2 to Aug 25, 2007 (for different languages)
Release of Test Data: Sept 13, 2007
Annotated Test Data Submission Deadline: Sept 15 16, 2007
Paper Submission Deadline: ~~Sept 21, 2007~~ Sept 25 (11:59 pm PST)
Notification of Paper Acceptance: ~~Oct 26, 2007~~ Nov 5, 2007
Camera Ready Submission Deadline: Nov 16, 2007

Note: There is no separate registration for the shared task (the contest). You will be a contestant if you submit the annotated test data by the deadline mentioned above.

Introduction

Since most of the South and South East Asian languages are scarce in resources as well as tools, it is very important that good systems for Named Entity Recognition (NER) be available, because many problems in information extraction and machine translation (among others) are dependent on accurate NER. However, the issues involved are significantly different for these languages from those for European languages. For example, these languages do not have capitalization, which is a major feature for NER systems for European languages. Another similarity among these languages is that most of them use scripts of Brahmi origin. For some languages, there are additional issues like word segmentation (e.g. for Thai). Large gazetteers are not available for most of these languages. There is also the problem of lack of standardization and spelling variation. The number of frequently used words which can also be used as names is very large for many languages, unlike European languages where a larger proportion of the first names are not used as common words. And most importantly, there is a serious lack of labeled data for machine learning.