Precision, recall and F-measure will have to be calculated for two cases: maximal named entities and nested named entities. Thus, there will be six measures of performance:
- Maximal Precision: Pm = cm/rm
- Maximal Recall: Rm = cm/tm
- Maximal F-Measure: Fm = 2 × Pm × Rm / (Pm + Rm)
- Nested Precision: Pn = cn/rn
- Nested Recall: Rn = cn/tn
- Nested F-Measure: Fn = 2 × Pn × Rn / (Pn + Rn)
where c is the number of correctly retrieved (identified) named entities, r is the total number of named entities retrieved by your system (correct plus incorrect) and t is the total number of named entities in the test data.
Then there will be three cases for each of these six measures: boundary identification, labelling, and boundary identification plus labelling. Therefore, the participants will have to report at least eighteen performance values.
Evaluation will be automatic and will be against the manually prepared test data given to you. An evaluation script for this purpose is available as a zip file and as a tar file. This scripts assumes that there is a single test and reference file, the number and order of sentences is the same in both, and that tokenization (number and order of words) has not been changed by the NER system.
The format accepted by the evaluation script is the same as given in the