IJCNLP 2008

Workshop on NER for South and South East Asian Languages

January 12, 2008, IIIT, Hyderabad, India

Home
IJCNLP Home
Call for Papers
Shared Task
Tagset
Annotation Guidelines
Data
Evaluation
Tools
Proceedings
Registration
Submission
Tutorial
Accepted Papers
Invited Talks
Workshop Programs
Program Committee
Acknowledgements
Flyer
FAQ
Draft Papers
Contact

Annotation Guidelines

General Guidelines

Note that these guidelines are for manual annotation, not for automatic annotation. The guidelines for automatic annotation, i.e., for the NER system are described in the Task Description.

  • Specificity: The most important criterion while deciding whether some expression is a named entity or not is to see whether that expression specifies something definite and identifiable as if by a name or not. This decision will have to be based on the context. For example, 'aanand' (in South Asian languages, where there is no capitalization) is not a named entity in 'saba aanand hii aanand hai' ('There is bliss everywhere'). But it is a named entity in 'aanand kaa yaha aakhiri saala hai' ('Anand is in the last year (of his studies)'). Number, Measure and Term may be seen as exceptions (see below).
  • Maximal Entity: Only the maximal entities have to be annotated manually. Structure of entities will not be annotated by the annotators, even though it has to be learnt by the NER systems. For example, 'One Hundred Years of Solitude' has to be annotated as one entity. 'One Hundred' is not to be marked as a Number here, nor is 'One Hundred Years' to be made marked as a Measure in this case. The purpose of this guideline is to make the task of annotation for several languages feasible, given the constraints for South Asian languages. (See Structure Named Entities in the Task Description)
  • Ambiguity: In cases where the entity can have two valid tags, the more appropriate one is to be used. Annotator has to make the decision in such cases. It is recommended that the annotation be validated by another person, or even more preferably, two different annotators have to work on the same data independently and inconsistencies have to be resolved by an adjudicator. Abbreviation is an exception to the Ambiguity guideline (see below).

Guidelines for Specific Tags

  • Abbreviations: Even though every Abbreviation is also Abbreviation. For example, APJ is an Abbreviation, but also a Person. IBM is also an Organization. Such ambiguity cannot be resolved from the context because it is due to the (wrong?) assumption that a named entity can have only tag. However, the annotators were asked to mark APJ, IBM etc. as abbreviations only. Multiple annotations were not allowed. This is an exception to the third guideline above.
  • Designation and Title-Person: An entity is a Designation if it is something formal and official with some responsibilities. If it just something honorary, then it is a Title-Object. For example, 'Event Coordinator' or 'Research Assistant' is a Designation, but 'Chakravarti' or 'Mahatma' are Titles.
  • Organization and Brand: The distinction between these two has to be made based on the context. For example, 'Pepsi' could mean Organization, but most it is more likely to mean a Brand.
  • Time and Location: Whether something is to be marked as Time or Location or not is to be decided based on the Specificity guideline and the context.
  • Number, Measure and Term: These three may not be strictly named entities in the way a person name is. However, we have included them because they are different from other words of the language. For problems like machine translation, they can be treated like named entities. For example, a Term is a word which can be directly translated into some language if we have a dictionary of technical terms. Once we know a word is a Term, there is likely to be less ambiguity about the intended sense of the word, unlike for other normal words.
Locations of visitors to this page