IJCNLP 2008

Workshop on NER for South and South East Asian Languages

January 12, 2008, IIIT, Hyderabad, India

Home
IJCNLP Home
Call for Papers
Shared Task
Tagset
Annotation Guidelines
Data
Evaluation
Tools
Proceedings
Registration
Submission
Tutorial
Accepted Papers
Invited Talks
Workshop Programs
Program Committee
Acknowledgements
Flyer
FAQ
Draft Papers
Contact

News

2nd December, 2007: Proceedings are now online.

28th November, 2007: Titles of the Invited Talks announced.

24rd November, 2007: Workshop Program announced.

21st November, 2007: Speakers for the Invited Talks announced.

18th November, 2007: Information for Asian Fund Application is online at the main IJCNLP site.

18th November, 2007: Registration is open since 1st November.

17th November, 2007: Information about the Invited Talks at the workshop will be added by 21st November.

16th November, 2007: Information about sending the hard copy of the Copyright Transfer Form added.

9th November, 2007: List of accepted papers posted on the site.

5th November, 2007: The list of accepted papers is being finalized and the notification mails will be sent shortly. The list will also be posted on this site. There might be a slight delay in some cases because some of the reviews are still awaited.

27th October, 2007: Notification date extended to 5th November, 2007

25th September, 2007: New version of the evaluation script made available. This version gives details about individual tags and also calculates precision etc. based on lenient (lexical) matching. Please check and inform in case of any error.

20th September, 2007: Deadline for paper submission extended to 25th September (11:59 pm PST).

17th September, 2007: Annotated test data uploaded.

14th September, 2007: A Frequently Asked Questions (FAQ) section started.

14th September, 2007: Test data released for Urdu.

13th September, 2007: Test data released for Hindi, Bengali, Oriya and Telugu.

13th September, 2007: Information about submission of test result data added.

9th September, 2007: Data is now available in ASCII based notations for all the five languages.

30th August, 2007: Training data released for Urdu.

28th August, 2007: Training data released for Telugu.

25th August, 2007: Training data released for Oriya.

23rd August, 2007: Training data released for Bengali.

14th August, 2007: Training data released for Hindi.

10th July, 2007: Workshop site launched.

View Flyer
JPG, PDF

Important Dates

  • Release of Training Data: Aug 2 to Aug 25, 2007 (for different languages)
  • Release of Test Data: Sept 13, 2007
  • Annotated Test Data Submission Deadline: Sept 15 16, 2007
  • Paper Submission Deadline: Sept 21, 2007 Sept 25 (11:59 pm PST)
  • Notification of Paper Acceptance: Oct 26, 2007 Nov 5, 2007
  • Camera Ready Submission Deadline: Nov 16, 2007

Note: There is no separate registration for the shared task (the contest). You will be a contestant if you submit the annotated test data by the deadline mentioned above.

Introduction

Since most of the South and South East Asian languages are scarce in resources as well as tools, it is very important that good systems for Named Entity Recognition (NER) be available, because many problems in information extraction and machine translation (among others) are dependent on accurate NER. However, the issues involved are significantly different for these languages from those for European languages. For example, these languages do not have capitalization, which is a major feature for NER systems for European languages. Another similarity among these languages is that most of them use scripts of Brahmi origin. For some languages, there are additional issues like word segmentation (e.g. for Thai). Large gazetteers are not available for most of these languages. There is also the problem of lack of standardization and spelling variation. The number of frequently used words which can also be used as names is very large for many languages, unlike European languages where a larger proportion of the first names are not used as common words. And most importantly, there is a serious lack of labeled data for machine learning.

Locations of visitors to this page