Here we briefly introduce some common evaluation metrics in NER tasks, considering both extracted boundary and entities.

Scenarios-that-NER-systems-predict

Exact Match

1) Surface entity and type match （Both entity boundary and type are correct）
2) System hypothesized an entity (predict entity that does not exist in ground truth)
3) Systems miss an entity (entity exists in ground truth, but is not predicted by NER system)

Partial Match (Overlapping)

4) Wrong entity type ( correct entity boundary, type disagree)
5) Wrong boundaries （boundary overlap）
6) Wrong boundaries and wrong entity type

Evaluation Metrics

CoNLL-2003: Computational Natural Language Learning

Only considers previous 1,2,3 scenarios
Exact match: precision, recall, f1 measure
See Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition for details.

Automatic Content Extraction (ACE)

Include weighting schema
See Automatic Content Extraction 2008 Evaluation Plan (ACE08)
See The Automatic Content Extraction (ACE) Program: Tasks, Data, and Evaluation

Message Understanding Conference (MUC)

Consider both entity boundary and entity type
Correct (COR): match
Incorrect(INC)：not match
Partial(PAR)：predicted entity boundary overlap with golden annotation，but they are not the same
Missing(MIS)：golden annotation boundary is not identified （predictee do not have, but golden label do）
Spurius(SPU)：predicted entity boundary does not exist in golden annotation（predictee have, but golden label do not）
See MUC-5 EVALUATION METRICS
Implementation in python version

SemEval‘13

Strict：Exact match (Both entity boundary and type are correct)
Exact boundary matching：predicted entity boundary is correct, regardless of entity boundary
Partial boundary matching：entity boundaries overlap, regardless of entity boundary
Type matching：some overlap between the system tagged entity and the gold annotation is required;

Scenario	Golden Standard		NER system prediction		Measure
	Entity Type	Entity Boundary (Surface String)	Entity Type	Entity Boundary (Surface String)	Type	Partial	Exact	Strict
III	MUSIC_NAME	告白气球			MIS	MIS	MIS	MIS
II			MUSIC_NAME	年轮	SPU	SPU	SPU	SPU
V	MUSIC_NAME	告白气球	MUSIC_NAME	一首告白气球	COR	PAR	INC	INC
IV	MUSIC_NAME	告白气球	SINGER	告白气球	INC	COR	COR	INC
I	MUSIC_NAME	告白气球	MUSIC_NAME	告白气球	COR	COR	COR	COR
VI	MUSIC_NAME	告白气球	SINGER	一首告白气球	INC	PAR	INC	INC

Number of golden standard:

$Possible(POS) = COR+INC+PAR+MIS = TP + FN$

Number of predictee:

$Actual(ACT) = COR + INC + PAR + SPU = TP + FP$

Exact match(i.e. Strict, Exact)

$Precision = \frac{COR}{ACT} = \frac{TP}{TP+FP}$ $Recall =\frac{COR}{POS}=\frac{TP}{TP+FN}$

Partial match (i.e. Partial, Type)

$Precision = \frac{COR + 0.5\times PAR}{ACT}$ $Recall = \frac{COR+0.5 \times PAR}{POS}$

F-measure

$F_\alpha = \frac{(\alpha^2+1)PR}{\alpha^2P+R}$ $F_1 = \frac{2PR}{P+R}$

Measure	Type	Partial	Exact	Strict
Correct	2	2	2	1
Incorrect	2	0	2	3
Partial	0	2	0	0
Missed	1	1	1	1
Spurius	1	1	1	1
Precision	0.4	0.6	0.4	0.2
Recall	0.4	0.6	0.4	0.2
F1 score	0.4	0.6	0.4	0.2

Pypi library eval4ner installation: pip install -U eval4ner

For attribution in academic contexts, please cite this work as:

@misc{chai2021NER-eval,
  author = {Chai, Yekun},
  title = {{Evaluation Metrics of Named Entity Recognition}},
  year = {2021},
  howpublished = {\url{https://cyk1337.github.io/notes/2018/11/21/NLP/NER/NER-Evaluation-Metrics/}},
}

The Gradient

Evaluation Metrics of Named Entity Recognition