The tagset for the shared task has 12 tags:
- NEP (Person): 'Orhan Pamuk' or 'Mark Twain' or 'Mohandas Karamchand Gandhi' or simply 'Gandhi'
- NED (Designation): 'Chairman' (as in 'Chairman Mao') or just 'The Chair' or 'President' (as in 'President Bush') or 'Baadshaah' (as in 'Baadshaah Akbar')
- NEO (Organization): 'State Government' or 'Microsoft' or 'Al Qaida' or 'The Ministry of Love'
- NEA (Abbreviation): 'IBM' (or I.B.M.) or 'CRF' or 'APJ' or 'KKK' or 'VHP'
- NEB (Brand): 'Fanta' or 'Windows' or 'Linux' or 'Thumbs Up' or 'HP Laserjet 5M'
- NETP (Title-Person): 'Mr.' or 'Shri' or 'Mahatma' or 'Chakravarti' (as in 'Chakravarti Rajagopalachari')
- NETO (Title-Object): 'The Seven Year Itch' or 'American Beauty' or '1984' (as in '1984 by George Orwell') or 'One Hundred Years of Solitude'
- NEL (Location): 'Delhi' or 'New Delhi' or 'Uttar Bhaarat'
- NETI (Time): '10th July', '1968', '5 pm', 'Chaitra ke teesare din'
- NEN (Number): 'Fifty five', '3.14', 'one lakh'
- NEM (Measure): 'five kilos', 'three days', 'seven years'
- NETE (Terms): 'Horticulture', 'Conditional Random Fields', 'Sociolinguistics', 'The Butterfly Effect'
The tagset being used for this shared task consists of more tags than the four tags for the CONLL 2003 shared task on Named Entity Recognition (NER). The reason we opted for these tags was that we needed a slightly finer tagset for machine translation (MT). The initial aim was to improve the performance of the MT system.
As annotation progressed, we realized that there were some problems we had not anticipated. Some classes were hard to distinguish in some contexts, making the task hard for annotators and bringing in inconsistencies. For example, it was not always clear whether something should be marked as Number or as Measure. Similarly for Time and Measure. Another difficult class was that of (technical) terms. Is 'agriculture' a term or not? If no (as most people would say), is 'horticulture' a term or not? In fact, Term was the most difficult class to mark.
An option that we explored was to merge the above mentioned confusable classes and ignore the Term class. But we already had a relatively large corpus marked up with these classes. If we merged some classes and ignored the Term class (which had a very large coverage and is definitely going to be useful for MT), we would be throwing away a lot of information. And we also had some corpus annotated by others which was based on a different tagset. So some problems were inevitable. Finally, we decided to keep the original tagset, with one modification: the Title tag. The initial tagset had only eleven tags. There was one major problem with this tagset. There was one Title tag but it had two different meanings: 'Mr.' is a Title, but 'The Seven Year Itch' is also a Title. This tag clearly needed to be split into two: Title-Person and Title-Object
We should mention here that we considered using another tagset developed at AUKBC, Chennai. This was based on ENAMEX, TIMEX and NUMEX. The total number of tags is more than a hundred. This tagset is meant specifically for MT and only for certain domains (health, tourism). Moreover, this is a tagset for entities in general, not just named entities. But the main reason we decided not to use this tagset was that the existing Hindi and Telugu corpora annotated at IIIT, Hyderabad was based on the tagset described above.