Name Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.

Approaches:

Statistical ML methods: HMM, MEMM, CRF
Deep learning methods: RNN-CRF, CNN-CRF

Hidden Markov Model

HMM consists of a discrete-time, discrete-state Markov chain, with hidden states $z_t \in {1,…,K} $
, plus an observation model $p(\mathbf{x}_t|z_t)$.

HMM is a generative model that optimises the likelihood $P(W|T)$, consisting of three components:

Initial probability $\pmb{\pi}$
Transition probability matrix $\pmb{A}$
Emission probability matrix $\pmb{B}$.

The joint distribution is:

$p( \mathbf{z}_{1:T}, \mathbf{x}_{1:T}) =p( \mathbf{z}_{1:T}) p( \mathbf{x}_{1:T}| \mathbf{z}_{1:T}) = \underbrace{p(z_1)}_{\pmb{\pi}} \underbrace{\prod_{t=2}^T p(z_t|z_{t-1})}_{\pmb{A}} \underbrace{[\prod_{t=1}^T p(\mathbf{x}_t|z_t)]}_{\pmb{B}}$

We estimate the posterior by combining the likelihood and the prior P(T).

$\hat{T} = \mathop{\arg\max}_T P(T|W) \\ = \mathop{\arg\max}_T P(W|T) P(T) \\ = \mathop{\arg\max}_T \underbrace{\prod_i P(\textrm{word}_i | \textrm{tag}_i)}_\textrm{emission probability} \underbrace{\prod_i P(\textrm{tag}_i|tag_{i-1})}_\textrm{transmission probability}$

HMM states only conditions on the previous state.

HMM cons: Independence assumptions.

Maximum-Entropy Markov Model (MEMM)

HMM based on the probabilities of transmission probability matrix and emission probability matrix. It is hard to encode the knowledge into these two matrices. If we include many knowledge sources, like capitalisation, the presence of hyphens, word endings. It is not easy to fit the probability like $P(\textrm{capitalisation} | \textrm{tag}), P(\textrm{hyphen} | \textrm{tag}), P(\textrm{suffix} | \textrm{tag}), P(\textrm{suffix} | \textrm{tag})$ into an HMM-style model.

HMM assumes the independence between observations $z$, while MEMM does not. However, MEMM suffers from the label bias problem.

MEMM is a discriminative model. It computes the posterior $P(Z=T|X=W)$ directly. MEMM states conditioned on the previous state and current observation.

$\begin{aligned} P(Z \vert X) &{}= \prod_{t=1}^T p(z_t \vert x_t, z_{t-1}) \\ &{}= \prod_{t=1}^T \frac{1}{Z(x_t, z')} \exp \bigg( \sum_t \lambda_t f(x_t, z_t) \bigg) \end{aligned}$

where $f{\cdot}$ is real-valued feature functions, $Z$ is a normalization term.

Thus,

$\hat{T} = \mathop{\arg\max}_T P(T|W) \\ = \mathop{\arg\max}_T \prod_i P(\textrm{tag}_i | \textrm{word}_i, \textrm{tag}_{i-1})$

Decoding: Viterbi algorithm.

Pros over HMM: offer increased freedom in choosing features to represent obervations.

~~Strict left-to-right word classifier~~.
Cons: only use left sequence information. It cannot consider future sequence information.

MEMM weakness:

Label bias problem: states with low-entropy transition distributions “effectively ignore their observations”. MEMM normalizes over the set of possible output labels at each time step, which is locally normalized. “Conservation of score mass just says that the outgoing scores from a state for a given observation are normalized.”^[7] The result is any inference process will bias towards states with fewer outgoing transitions.

CRFs were designed to overcome this weakness.

vs HMM

unlike the HMM, the MEMM can condition on any useful feature of the input observation. In the HMM this wasn’t possible because the HMM is likelihood-based, hence would have needed to compute the likelihood of each feature of the observation.

$Y$ denotes the state sequence, $X$ denotes the observation.

HMM

$P(Y|X) = \prod_{i=1}^n P(x_i|y_i) \times \prod_{i=1}^n P(y_i|y_{i-1})$

MEMM

$P(Y|X) = \prod_{i=1}^n P(y_i|y_{i-1},x_i)$

To estimate the individual probability of a transition from a state $q’$ to a state $q$ producing an observation $o$, build a MaxEnt model:

$P(y|y',x) = \frac{1}{Z(x,y')} \exp(\sum_i w_i f_i(x,y))$

CRF

Linear-Chain CRF

Modeling the distribution of a set of ouputs $y_{1:T}$ given an input vector $\mathbf{x}$.

$p(y_{1:T}|x, \mathbf{\lambda}) = \frac{1}{Z(\mathbf{x}, \mathbf{\lambda})} \prod_{t=2}^T \phi_t (y_t, y_{t-1}, \mathbf{x}, \mathbf{\lambda})$

where $\lambda$ are the free parameters of the potentials, common form:

$\exp(\sum_{k=1}^K \lambda_k f_{k,t}(y_t,y_{t-1},\mathbf{x}))$

where $f_{k,t}(y_t,y_{t-1}, \mathbf{x})$ are features.

Thus,

$\begin{aligned} p(y_{1:T}|x, \mathbf{\lambda}) &{}= \frac{1}{Z(\mathbf{x}, \mathbf{\lambda})} \prod_{t=2}^T \exp(\sum_{k=1}^K \lambda_k f_{k,t}(y_t,y_{t-1},\mathbf{x})) \\ &{}= \frac{1}{Z(\mathbf{x}, \mathbf{\lambda})} \exp(\sum_{t=2}^T \sum_{k=1}^K \lambda_k f_{k,t}(y_t,y_{t-1},\mathbf{x})) \end{aligned}$

Given a set of input-output sequence pairs, $\mathbf{x}^n, y_{1:T}^n, n=\{1,2,...,N\}$ . We can learn the parameters $\lambda$ by Maximum Likelihood. Under the i.i.d. data assumption, the log likelihood is:

$L(\mathbf{\lambda}) = \sum_{t,n} \sum_{k} \lambda_k f_k (y^n_t, y^n_{t-1}, x^n) - \sum_n \log Z(\mathbf{x}^n, \mathbf{\lambda})$

Cons: heavily rely on engineering features

BiLSTM-CRF

Intuition:
Capture both information from history (forward LSTM) and future (backward LSTM) using bi-LSTM; Also use sentence level tag information (after concatenation of both forward and backward LSTM/GRU hidden states in each time step) followed by the CRF model; Each output after concatenation can be regarded as a sentence level output.^[2] CRF can help learn the boundary constraints, for example, ‘B-‘ is the start of a tag.

Pros: More robust, less affected by the removal of engineering features

Cons: RNNs are not as GPU-efficient as CNNs in terms of training speed and efficiency.

Let $y$ be the tag sequence and $x$ an input word sequence. Then, we have

$P(y \vert x) = \frac{\exp (\textrm{score}(x,y))}{\sum_{y'} \exp(\textrm{score}(x,y'))}$

where

$\begin{aligned} \textrm{score}(x,y) &{}=\sum_{i} \psi_i (x,y) \\ &{}= \sum_i \log \psi_{textrm{emit}} (y_i \rightarrow x_i) + \log \psi_{\textrm{trans}} (y_{i-1} \rightarrow y_i) \\ &{}= \sum_i h_i [y_i] + \mathbf{P}_{y_i, y_i-1} \end{aligned}$

In BiLSTM-CRFs, two potentials: emission and transition. The emission potential for the word $i$ comes from the hidden state of the BiLSTM at timestep $i$. The transition scores are stored at $\mathbf{P} \in \mathbb{R}^{|T| \times |T|}$. In the following implementation^[8], $\mathbf{P}_{y_i, y_i-1}$ is the score of transitioning to tag $i-1$ from tag $i-1$.

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(1)

# util function
def argmax(vec):
    # return the argmax as a python int
    _, idx = torch.max(vec, 1)
    return idx.item()


def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


# Compute log sum exp in a numerically stable way for the forward algorithm
def log_sum_exp(vec):
    max_score = vec[0, argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + \
        torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
        
# BiLSTM CRF
class BiLSTM_CRF(nn.Module):

    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
        super(BiLSTM_CRF, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.tag_to_ix = tag_to_ix
        self.tagset_size = len(tag_to_ix)

        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,
                            num_layers=1, bidirectional=True)

        # Maps the output of the LSTM into tag space.
        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

        # Matrix of transition parameters.  Entry i,j is the score of
        # transitioning *to* i *from* j.
        # $\mathbf{P}_{y_i, y_i-1}$
        self.transitions = nn.Parameter(
            torch.randn(self.tagset_size, self.tagset_size))

        # These two statements enforce the constraint that we never transfer
        # to the start tag and we never transfer from the stop tag
        self.transitions.data[tag_to_ix[START_TAG], :] = -10000
        self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000

        self.hidden = self.init_hidden()

    def init_hidden(self):
        return (torch.randn(2, 1, self.hidden_dim // 2),
                torch.randn(2, 1, self.hidden_dim // 2))

    def _forward_alg(self, feats):
        # Do the forward algorithm to compute the partition function
        init_alphas = torch.full((1, self.tagset_size), -10000.)
        # START_TAG has all of the score.
        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

        # Wrap in a variable so that we will get automatic backprop
        forward_var = init_alphas

        # Iterate through the sentence
        for feat in feats:
            alphas_t = []  # The forward tensors at this timestep
            for next_tag in range(self.tagset_size):
                # broadcast the emission score: it is the same regardless of
                # the previous tag
                emit_score = feat[next_tag].view(
                    1, -1).expand(1, self.tagset_size)
                # the ith entry of trans_score is the score of transitioning to
                # next_tag from i
                trans_score = self.transitions[next_tag].view(1, -1)
                # The ith entry of next_tag_var is the value for the
                # edge (i -> next_tag) before we do log-sum-exp
                next_tag_var = forward_var + trans_score + emit_score
                # The forward variable for this tag is log-sum-exp of all the
                # scores.
                alphas_t.append(log_sum_exp(next_tag_var).view(1))
            forward_var = torch.cat(alphas_t).view(1, -1)
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        alpha = log_sum_exp(terminal_var)
        return alpha

    def _get_lstm_features(self, sentence):
        self.hidden = self.init_hidden()
        embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
        lstm_out, self.hidden = self.lstm(embeds, self.hidden)
        lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
        lstm_feats = self.hidden2tag(lstm_out)
        return lstm_feats

    def _score_sentence(self, feats, tags):
        # Gives the score of a provided tag sequence
        score = torch.zeros(1)
        tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
        for i, feat in enumerate(feats):
            score = score + \
                self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
        score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
        return score

    def _viterbi_decode(self, feats):
        backpointers = []

        # Initialize the viterbi variables in log space
        init_vvars = torch.full((1, self.tagset_size), -10000.)
        init_vvars[0][self.tag_to_ix[START_TAG]] = 0

        # forward_var at step i holds the viterbi variables for step i-1
        forward_var = init_vvars
        for feat in feats:
            bptrs_t = []  # holds the backpointers for this step
            viterbivars_t = []  # holds the viterbi variables for this step

            for next_tag in range(self.tagset_size):
                # next_tag_var[i] holds the viterbi variable for tag i at the
                # previous step, plus the score of transitioning
                # from tag i to next_tag.
                # We don't include the emission scores here because the max
                # does not depend on them (we add them in below)
                next_tag_var = forward_var + self.transitions[next_tag]
                best_tag_id = argmax(next_tag_var)
                bptrs_t.append(best_tag_id)
                viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
            # Now add in the emission scores, and assign forward_var to the set
            # of viterbi variables we just computed
            forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
            backpointers.append(bptrs_t)

        # Transition to STOP_TAG
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        best_tag_id = argmax(terminal_var)
        path_score = terminal_var[0][best_tag_id]

        # Follow the back pointers to decode the best path.
        best_path = [best_tag_id]
        for bptrs_t in reversed(backpointers):
            best_tag_id = bptrs_t[best_tag_id]
            best_path.append(best_tag_id)
        # Pop off the start tag (we dont want to return that to the caller)
        start = best_path.pop()
        assert start == self.tag_to_ix[START_TAG]  # Sanity check
        best_path.reverse()
        return path_score, best_path

    def neg_log_likelihood(self, sentence, tags):
        feats = self._get_lstm_features(sentence) # emission scores
        forward_score = self._forward_alg(feats) # partition function
        gold_score = self._score_sentence(feats, tags) # correct score (numerator) $\exp(\psi(\cdot))$
        return forward_score - gold_score

    def forward(self, sentence):  # dont confuse this with _forward_alg above.
        # Get the emission scores from the BiLSTM
        lstm_feats = self._get_lstm_features(sentence) # emission scores

        # Find the best path, given the features.
        score, tag_seq = self._viterbi_decode(lstm_feats)
        return score, tag_seq

Stack-LSTM

Char-based word representation indicates the word internal information, whilst pretrained word embeddings hold contextual text information.
Then, concat character-based word representation using Bi-LSTM and pretrained word embeddings as the final embeddings. ^[6]

ID-CNN-CRF

Pros: fast, resource-efficient

Dilation convolutions:

context does not need to be consecutive
By stacking layers of dilated convolutions of exponentially dilation width, we can expand the size of the effective input width to cover the entire length of most sequences using only a few layers: the size of the effective input width for a token at layer l is now given by 2^l+1-1
Let x_t denotes the token in timestep t, W_t is a sliding window of width $r$ tokens on either side of each token in the sequence, the conventional convolution output output c_t is:
$c_t= W_c \bigoplus_{k=0}^{r} x_{t} \pm k$ , where $\bigoplus$ is vector concatenation.
Dilated convolution is defined over a wider effective input width by skpping over \delta inputs, where $\delta$ is the dilation width. The dialated convolution is: $c_t= W_c \bigoplus_{k=0}^{r} x_{t} \pm k \delta$
When $\delta$ is 1, dialated convolution is equivalent to a simple convolution.

Model: Stacked dilated CNNs ^[3]

Let the $j_{th}$ dilated Conv layer of dilation width $\delta$ as $D_{\delta}^{(j)}$ , $L_c$ layers output:

$c_t^{(j)} = Relu(D_{2^{L_c-1}}^{(j-1)} c_t^{j-1})$

And add a final dilation-1 layer to the stack:

$c_t^{L_c+1} = Relu(D_1^{(L_c)} c_t^{L_C})$

Finally, apply the simple affine transformation $W_0$ to the final representation for each token $x_t$ :

$h_T^{(L_b)} = W_O b_t^{L_b}$

ID-CNNs generate token-wise representation (corresponding to logits for each token in RNNs) for each token.

In comparison with BiLSTM-CRF

Bi-LSTM or ID-CNNs calculates the logits for each token, whilst CRF layer capture the transmission probability of sequential inputs, i.e. use NNs to predict each token’s label independently, or apply Viterbi inference in a chain structured graphical model.

BERT

Bidirectional Transformer for Language model, with pretraining methods of Masked Language Models (predicting randomly masked 15% tokens of each sentence) and next sentence prediction (to capture information of ajacent sentences).^[5]

Chinese NER

Chinese NER
Word-based approach

Intuitive pipeline: segmentation → NER
Suffer from segmentation error: NE imports OOV errors in segmentation, and incorrectly segmented entity boundaries lead to NER errors.

Char-based approach

Char-based NER outperform word-based approach in Chinese NER.
Cons: Explicit word and word sequence information are not fully exploited.

Solution:

Lattice LSTM: leverage both char sequence and lexicon word information.
FLAT

Char-based LSTM-CRF model

$\mathbf{x}_j^c = \mathbf{e}^c(c_j)$, where $\mathbf{e}^c$ denotes a char embedding lookup table.
Bi-LSTM output: $\mathbf{h_j^c} = [\overrightarrow{\mathbf{h}_j^c} ; \overleftarrow{\mathbf{h}_j^c}]$
CRF model use $\mathbf{h_1^c}, \mathbf{h_2^c}, …,\mathbf{h_m^c}$ for sequence labelling.

Char + bi-char

Concat char embeddings with char-bigram embeddings.

$\mathbf{x_j^c} = [ \mathbf{e}^c(c_j) ; \mathbf{e}^b(c_j, c_{j+1}) ]$

,where $\mathbf{e}^b$ denotes a char bigram lookup table

Char + softword

add segmentation as soft features by concatenating segmentation label embedding to char embeddings

$\mathbf{x}_j^c = [\mathbf{e}^c (c_j) ; \mathbf{e^s}(seg(c_j)) ]$

, where $\mathbf{e^s}$ signals the segmentation label on the char $c_j$ given by a word segmentor with BMES scheme.

Word-based model

The input for each word $w_i$: $\mathbf{x_i^w} = \mathbf{e}^w (w_i)$, where $e^w$ denotes the word embedding lookup table.
Then feed word embeddings into bi-directional LSTMs.

Integrating char-representations

concat char-based word representation $\mathbf{x}_i^c$ (from cahr-LSTMs or char-CNNs) with pretrained word embeddings $\mathbf{e}^w(w_i)$: $\mathbf{x}_i^w = [\mathbf{e}^w(w_i); \mathbf{x}_i^c]$

1.Word + char bi-LSTMs: bi-LSTMs with chars as input in each time step.

$\mathbf{x}_i^c = [\overrightarrow{\mathbf{h}^c_{t(i,len(i))}} ; \overleftarrow{\mathbf{h}^c_{t(i,1)}}]$

2.Word + char-uni-LSTM

3.Word + char-CNNs
Use char sequence of each word to obtain its char representation $\mathbf{x}_i^c$:

$\mathbf{x}_i^c = \max_{t(i,1)\leq j \leq t(i,len(i))} (\mathbf{W}_{CNN}^T \bigl[ \begin{smallmatrix} \mathbf{e}^c (c_{j-\frac{k-1}{2}}) \\ ... \\ \mathbf{e}^c (c_{j+\frac{k-1}{2}}) \end{smallmatrix} \bigr] + \mathbf{b}_{CNN})$

, where $ k=3 $ is the kernel size and $max$ denotes the max-pooling.

Lattice LSTM

Let $s$ denotes input sequence.

Char-level: $s= c_1,c_2,...,c_m$ , where $c_j$ denotes the $j_{th}$ character.
Word-level: $s= w_1,w_2,...,w_n$ , where $w_i$ denotes the $i_{th}$ token in the word sequence.

In comparison with char-based model:

Char embedding: $\mathbf{x}_j^c = \mathbf{e}^c(c_j)$

Basic recurrent LSTM:

$\left[\begin{array}{c} \mathbf{i}^c_j\\ \mathbf{o}^c_j \\ \mathbf{f}^c_j \\ \tilde{c}^c_j \end{array}\right] = \left[\begin{array}{c} \sigma \\ \sigma \\ \sigma \\ \tanh \end{array}\right] (\mathbf{W}^{c^T} \left[\begin{array}{c} \mathbf{x}^c_j \\ \mathbf{h}^c_{j-1}\end{array}\right] + \mathbf{b}^c)$ $\mathbf{c}^c_j = \mathbf{f}^c_j \odot \mathbf{c}^c_{j-1} + \mathbf{i}^c_j \odot \tilde{c}^c_{j}$ $\mathbf{h}_j^c = \mathbf{o}_j^c \odot \tanh(\mathbf{c}^c_j)$

where $\mathbf{i}^c_j$, $\mathbf{f}^c_j$, $\mathbf{o}^c_j$ denotes a set of input, forget and output gates, respectively. $\mathbf{c}^c_j$ denotes the char cell vector, $\mathbf{h}^c_j$ denotes the hidden vector on each char $c_j$, $\mathbf{W}^{c^T}$, $\mathbf{b}^c$ are parameters.

The computation of cell vector $\mathbf{c}^c_j$ considers lexicon subsequence $w_{b,e}^d$ in the sentence^[4].

Each subsequence $w_{b,e}^{d}$ is:

$\mathbf{x}_{b,e}^w = \mathbf{e}^w (w_{b,e}^d)$

, where $\mathbf{e}^w$ denotes the word embedding lookup table.

The word cell $\mathbf{c}^w_{b,e}$ represents the recurrent state of $\mathbf{x}^w_{b,e}$ from the beginning of the sentence. The $\mathbf{c}^w_{b,e}$ is:

$\left[\begin{array}{c} \mathbf{i}^w_{b,e} \\ \mathbf{f}^w_{b,e} \\ \tilde{c}^w_{b,e} \end{array}\right] = \left[\begin{array}{c} \sigma \\ \sigma \\\tanh \end{array}\right] (\mathbf{W}^{w^T} \left[\begin{array}{c} \mathbf{x}^w_{b,e} \\ \mathbf{h}^c_b\end{array}\right] + \mathbf{b}^w)$ $\mathbf{c}^w_{b,e} = \mathbf{f}^w_{b,e} \odot \mathbf{c}^c_b + \mathbf{i}^w_{b,e} \odot \tilde{c}^w_{b,e}$

where $\mathbf{i}^w_{b,e}$ , $\mathbf{f}^w_{b,e}$ are input and forget gates. There is no output gate for word cells since labelling is performed only at the char level.

Here $\mathbf{c}^w_{b,e}$ as an additional recurrent path for information flow for each $\mathbf{c}^c_{j}$ . It applies an additional gate $\mathbf{i}^c_{b,e}$ for each subsequence cell $\mathbf{c}^w_{b,e}$ for controlling its contribution into $\mathbf{c}^w_{b,e}$ :

$\mathbf{i}^c_{b,e} = \sigma (\mathbf{W}^{l^T} \left[\begin{array}{c} \mathbf{x}^c_{e} \\ \mathbf{c}^w_{b,e} \end{array}\right] + \mathbf{b}^l)$

Then cell vector $\mathbf{c}^c_j$ is:

$\mathbf{c}^c_j = \mathbf{\alpha}_{j}^c \odot \tilde{\mathbf{c}}_j^c + \Sigma_{b \in\{b'|w^d_{b',j}\} } \mathbf{\alpha}_{b,j}^c \odot \mathbf{c}^w_{b,j}$

where the gate value $\mathbf{i}^c_{b,j}$ , $\mathbf{i}^c_{j}$ are normalised to $\mathbf{\alpha}_{b,j}^c$ and $\mathbf{\alpha}_{j}^c$ by setting the sum to 1:

$\mathbf{\alpha}_{b,j}^c = \frac{\exp(\mathbf{i}^c_{b,j})}{\mathbf{i}^c_{j} + \Sigma_{b' \in {b''|w_{b'',j}^d}} \exp(\mathbf{i}^c_{b',j})}$ $\mathbf{\alpha}_{j}^c = \frac{\exp(\mathbf{i}^c_{j})}{\mathbf{i}^c_{j} + \Sigma_{b' \in {b''|w_{b'',j}^d}} \exp(\mathbf{i}^c_{b',j})}$ $\mathbf{h}_j^c = \mathbf{o}_j^c \odot \tanh(\mathbf{c}^c_j)$

CRF decoding

On top of $\mathbf{h}_1$ , $\mathbf{h}_2$ , …, $\mathbf{h}_{\tau}$ , where $\tau$ is $n$ for char-based and lattice-based models and $m$ for word-based models. The probability of a label sequence $y = l_1, l_2, ... , l_{\tau}$ is:

$P(y|s) = \frac{\exp(\Sigma_i (\mathbf{W}_{CRF}^{l_i} \mathbf{h}_i) +b_{CRF}^{(l_{i-1}, l_i)} )}{\Sigma_{y'} \exp(\Sigma_i (\mathbf{W}_{CRF}^{l'_i} \mathbf{h}_i) +b_{CRF}^{(l'_{i-1}, l'_i)} )}$

where $y’$ denotes an arbitrary label sequence, $\mathbf{W}_{CRF}^{l_i}$ is a model parameter specific to $l_i$ and $b_{CRF}^{(l_{i-1}, l_i)}$ is a bias specific to $l_{i-1}$ and $l_i$ .

Sentence-level log-likelihood loss with L2 regularization:

$L = \sum_{i=1}^N log(P(y_i|s_i)) + \frac{\lambda}{2}||\Theta||^2$

where $\lambda$ is the $L_2$ regularisation parameter and $\Theta$ represents the parameter set.

FLAT

Flat-Lattice Transformer (FLAT)

References

1.Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (Vol. 3). London: Pearson. ↩
2.Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint arXiv:1508.01991. ↩
3.Strubell, E., Verga, P., Belanger, D., & McCallum, A. (2017). Fast and accurate entity recognition with iterated dilated convolutions. arXiv preprint arXiv:1702.02098. ↩
4.Zhang, Y., & Yang, J. (2018). Chinese NER Using Lattice LSTM. arXiv preprint arXiv:1805.02023. ↩
5.Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. ↩
6.Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. arXiv preprint arXiv:1603.01360. ↩
7.The Label Bias Problem ↩
8.PyTorch BiLSTM-CRF ↩
9.BiLSTM-CRF Explanation (in Chinese) ↩

The Gradient

Named Entity Recognition Overview