Workshop on NER for South and South East Asian Languages

January 12, 2008, IIIT, Hyderabad, India

Call for Papers
Shared Task
Annotation Guidelines
Accepted Papers
Invited Talks
Workshop Programs
Program Committee
Draft Papers


Q.: We are facing some problem with Oriya and Telugu data. In Telugu data there is lack of sentence marker. In test data probably per line consist one sentence. If that is true then in Telugu data we hardly find sentence markers. In case of Oriya data a number of sentences consist just one charracter and sentence marker.

A.: Yes, we are aware of this. However, sentence boundaries shouldn't matter for NER problem. In the test data whatever is on one line can be treated as a sentence. There is no need for splitting sentences. You can just process the text line by line whether it is a partial sentence or more than one sentence. This problem is present because there is no good way to automatically split sentences and manual splitting would have taken a long time and data was prepared under severe time (and other) constraints.

Q.: Why do we have to submit results for all five language?

A.: To provide a somewhat fair ground for comparison and also to see how a given method developed for one or more languages works for other languages. There is different amount of training and testing data for different languages and the quality of the annotated corpus is also different. Also, to encourage development of techniques which work across languages.

Q.: For nested entities I am still not sure how you are finding the accuracy.

A.: We are just counting nested entities, just like the maximal entities. The nested entities are a superset of the maximal entities.

Q.: If I ignore story, sentence and node ids in the data which is in SSF format, will the evaluation script give different results?

A.: Sentence and node ids are not being used in the evaluation script, nor are they relevant for NER.

Q.: All the people who will be submitting system need to write a paper or only those who get good results will be allowed to write a paper.

A.: Everyone has to write a paper. The selection will be based on results as well as the paper.

Q.: Can we convert the input data to some other form, process it, and then convert the output back to the SSF format?

A.: Yes, of course. Just make sure that the data is in correct format and number of sentences in each file remain as before, otherwise the evaluation might be not be correct.

Q.: I have a simple inquiry. Is Arabic language included in the scope of the workshop? I have already published in ACL 07 workshop on Semitic languages. As I have a project in Arabic NER and have many issues to publish I thought it would be a good idea to publish in a specialized workshop given that the problems mentioned in the CFP are applying for Arabic.

A.: Strictly speaking, Arabic is not part of the contest as the contest is limited to South and South East Asian (SSEA) languages. However, if you can show that problems for Arabic are similar to those for (SSEA) languages and you also report the results of experiments on one of the (SSEA) languages (the data will be available on the workshop site in some time), then you can submit your paper to this workshop. You can submit a paper if it includes results of some test on at least one SSEA language. If the paper has some reasonable SSEA component, then it comes within the scope of the workshop. If the paper follow shared task guidelines, it can be considered for the shared task; otherwise it will go in the regular papers track.

Q.: Can I report results on languages other than SSEA languages, e.g. English, using the CoNLL shared task data?

A.: You can, provided you also report results for SSEA languages. In fact, comparison of results on English and SSEA languages will be useful.

Locations of visitors to this page