Here we briefly introduce some common evaluation metrics in NER tasks, considering both extracted boundary and entities.
Scenarios-that-NER-systems-predict
Exact Match
- 1) Surface entity and type match (Both entity boundary and type are correct)
 - 2) System hypothesized an entity (predict entity that does not exist in ground truth)
 - 3) Systems miss an entity (entity exists in ground truth, but is not predicted by NER system)
 
Partial Match (Overlapping)
- 4) Wrong entity type ( correct entity boundary, type disagree)
 - 5) Wrong boundaries (boundary overlap)
 - 6) Wrong boundaries and wrong entity type
 
Evaluation Metrics
CoNLL-2003: Computational Natural Language Learning
- Only considers previous 1,2,3 scenarios
 - Exact match: precision, recall, f1 measure
 - See Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition for details.
 
Automatic Content Extraction (ACE)
- Include weighting schema
 - See Automatic Content Extraction 2008 Evaluation Plan (ACE08)
 - See The Automatic Content Extraction (ACE) Program: Tasks, Data, and Evaluation
 
Message Understanding Conference (MUC)
- Consider both entity boundary and entity type
 - Correct (COR): match
 - Incorrect(INC):not match
 - Partial(PAR):predicted entity boundary overlap with golden annotation,but they are not the same
 - Missing(MIS):golden annotation boundary is not identified (predictee do not have, but golden label do)
 - Spurius(SPU):predicted entity boundary does not exist in golden annotation(predictee have, but golden label do not)
 - See MUC-5 EVALUATION METRICS
 - Implementation in python version
 
SemEval‘13
- Strict:Exact match (Both entity boundary and type are correct)
 - Exact boundary matching:predicted entity boundary is correct, regardless of entity boundary
 - Partial boundary matching:entity boundaries overlap, regardless of entity boundary
 - Type matching:some overlap between the system tagged entity and the gold annotation is required;
 
| Scenario | Golden Standard | NER system prediction | Measure | |||||
|---|---|---|---|---|---|---|---|---|
| Entity Type | Entity Boundary (Surface String) | Entity Type | Entity Boundary (Surface String) | Type | Partial | Exact | Strict | |
| III | MUSIC_NAME | 告白气球 | MIS | MIS | MIS | MIS | ||
| II | MUSIC_NAME | 年轮 | SPU | SPU | SPU | SPU | ||
| V | MUSIC_NAME | 告白气球 | MUSIC_NAME | 一首告白气球 | COR | PAR | INC | INC | 
| IV | MUSIC_NAME | 告白气球 | SINGER | 告白气球 | INC | COR | COR | INC | 
| I | MUSIC_NAME | 告白气球 | MUSIC_NAME | 告白气球 | COR | COR | COR | COR | 
| VI | MUSIC_NAME | 告白气球 | SINGER | 一首告白气球 | INC | PAR | INC | INC | 
Number of golden standard:
Number of predictee:
Exact match(i.e. Strict, Exact)
Partial match (i.e. Partial, Type)
F-measure
| Measure | Type | Partial | Exact | Strict | 
|---|---|---|---|---|
| Correct | 2 | 2 | 2 | 1 | 
| Incorrect | 2 | 0 | 2 | 3 | 
| Partial | 0 | 2 | 0 | 0 | 
| Missed | 1 | 1 | 1 | 1 | 
| Spurius | 1 | 1 | 1 | 1 | 
| Precision | 0.4 | 0.6 | 0.4 | 0.2 | 
| Recall | 0.4 | 0.6 | 0.4 | 0.2 | 
| F1 score | 0.4 | 0.6 | 0.4 | 0.2 | 
Pypi library eval4ner installation: pip install -U eval4ner
For attribution in academic contexts, please cite this work as:1
2
3
4
5
6@misc{chai2021NER-eval,
  author = {Chai, Yekun},
  title = {{Evaluation Metrics of Named Entity Recognition}},
  year = {2021},
  howpublished = {\url{https://cyk1337.github.io/notes/2018/11/21/NLP/NER/NER-Evaluation-Metrics/}},
}
