B站【1espresso】NLP - transform、bert、HMM、NER课件

git地址传送门传送门2（含bert情感分析）仅学习使用，侵删中文自然语言处理Transformer模型(一)transformer是谷歌大脑在2017年底发表的论文attention is all you need中所提出的seq2seq模型. 现在已经取得了大范围的应用和扩展, 而BERT就是从transformer中衍生出来的预训练语言模型.目前transformer模型已经得到广泛认可和应

吃一口桃酥

1226人浏览 · 2020-10-27 00:32:56

吃一口桃酥 · 2020-10-27 00:32:56 发布

git地址
 传送门
 传送门2（含bert情感分析）
仅学习使用，侵删

中文自然语言处理 Transformer模型(一)

transformer是谷歌大脑在2017年底发表的论文attention is all you need中所提出的seq2seq模型. 现在已经取得了大范围的应用和扩展, 而BERT就是从transformer中衍生出来的预训练语言模型.
目前transformer模型已经得到广泛认可和应用, 而应用的方式主要是先进行预训练语言模型, 然后把预训练的模型适配给下游任务, 以完成各种不同的任务, 如分类, 生成, 标记等等, 预训练模型非常重要, 预训练的模型的性能直接影响下游任务的性能.

一. transformer编码器(理论部分):
0. $t r a n s f o r m e r$ 模型的直觉, 建立直观认识;

$\ encoding$ , 即位置嵌入(或位置编码);
$\ attention \ mechanism$ , 即自注意力机制与注意力矩阵可视化;
$\ Normalization$ 和残差连接.
$\ encoder$ 整体结构.

二. transformer代码解读, 语料数据预处理, BERT的预训练和情感分析的应用:

三. sequence 2 sequence(序列到序列)模型或Name Entity Recognition(命名实体识别)(待定):
此部分根据前面的反馈待定.

0. $t r a n s f o r m e r$ 模型的直觉, 建立直观认识;

首先来说一下transformer和LSTM的最大区别, 就是LSTM的训练是迭代的, 是一个接一个字的来, 当前这个字过完LSTM单元, 才可以进下一个字, 而transformer的训练是并行了, 就是所有字是全部同时训练的, 这样就大大加快了计算效率, transformer使用了位置嵌入 $\ encoding)$ 来理解语言的顺序, 使用自注意力机制和全连接层来进行计算, 这些后面都会详细讲解.
transformer模型主要分为两大部分, 分别是编码器和解码器, 编码器负责把自然语言序列映射成为隐藏层(下图中第2步用九宫格比喻的部分), 含有自然语言序列的数学表达. 然后解码器把隐藏层再映射为自然语言序列, 从而使我们可以解决各种问题, 如情感分类, 命名实体识别, 语义关系抽取, 摘要生成, 机器翻译等等, 下面我们简单说一下下图的每一步都做了什么:

输入自然语言序列到编码器: Why do we work?(为什么要工作);
编码器输出的隐藏层, 再输入到解码器;
输入 $< s t a r t >$ (起始)符号到解码器;
得到第一个字"为";
将得到的第一个字"为"落下来再输入到解码器;
得到第二个字"什";
将得到的第二字再落下来, 直到解码器输出 $< e n d >$ (终止符), 即序列生成完成.

在这里插入图片描述

该部分内容限于编码器部分, 即把自然语言序列映射为隐藏层的数学表达的过程, 因为理解了编码器中的结构, 理解解码器就非常简单了,最重要的是BERT预训练模型只用到了编码器的部分, 也就是先用编码器训练一个语言模型, 然后再把它适配给其他五花八门的任务.

Transformer Block结构图, 注意: 为方便查看, 下面的内容分别对应着上图第1, 2, 3, 4个方框的序号:

在这里插入图片描述

1. $\ encoding$ , 即位置嵌入(或位置编码);

由于transformer模型没有循环神经网络的迭代操作, 所以我们必须提供每个字的位置信息给transformer, 才能识别出语言中的顺序关系.
现在定义一个位置嵌入的概念, 也就是 $\ encoding$ , 位置嵌入的维度为 $\ sequence \ length, \ embedding \ dimension]$ , 嵌入的维度同词向量的维度, $\ sequence \ length$ 属于超参数, 指的是限定的最大单个句长.
注意, 我们一般以字为单位训练transformer模型, 也就是说我们不用分词了, 首先我们要初始化字向量为 $\ size, \ embedding \ dimension]$ , $\ size$ 为总共的字库数量, $\ dimension$ 为字向量的维度, 也是每个字的数学表达.
在这里论文中使用了 $s i n e$ 和 $c o s i n e$ 函数的线性变换来提供给模型位置信息:
$PE_{(pos,2i)} = sin(pos / 10000^{2i/d_{\text{model}}}) \quad PE_{(pos,2i+1)} = cos(pos / 10000^{2i/d_{\text{model}}})\tag{eq.1}$
上式中 $p o s$ 指的是句中字的位置, 取值范围是 $\ max \ sequence \ length)$ , $i$ 指的是词向量的维度, 取值范围是 $\ embedding \ dimension)$ , 上面有 $s i n$ 和 $c o s$ 一组公式, 也就是对应着 $\ dimension$ 维度的一组奇数和偶数的序号的维度, 例如 $0, 1$ 一组, $2, 3$ 一组, 分别用上面的 $s i n$ 和 $c o s$ 函数做处理, 从而产生不同的周期性变化, 而位置嵌入在 $\ dimension$ 维度上随着维度序号增大, 周期变化会越来越慢, 而产生一种包含位置信息的纹理, 就像论文原文中第六页讲的, 位置嵌入函数的周期从 $\pi$ 到 $\pi$ 变化, 而每一个位置在 $\ dimension$ 维度上都会得到不同周期的 $s i n$ 和 $c o s$ 函数的取值组合, 从而产生独一的纹理位置信息, 模型从而学到位置之间的依赖关系和自然语言的时序特性.
下面画一下位置嵌入, 可见纵向观察, 随着 $\ dimension$ 增大, 位置嵌入函数呈现不同的周期变化.

# 导入依赖库
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

def get_positional_encoding(max_seq_len, embed_dim):
    # 初始化一个positional encoding
    # embed_dim: 字嵌入的维度
    # max_seq_len: 最大的序列长度
    positional_encoding = np.array([
        [pos / np.power(10000, 2 * i / embed_dim) for i in range(embed_dim)]
        if pos != 0 else np.zeros(embed_dim) for pos in range(max_seq_len)])
    positional_encoding[1:, 0::2] = np.sin(positional_encoding[1:, 0::2])  # dim 2i 偶数
    positional_encoding[1:, 1::2] = np.cos(positional_encoding[1:, 1::2])  # dim 2i+1 奇数
    # 归一化, 用位置嵌入的每一行除以它的模长
    # denominator = np.sqrt(np.sum(position_enc**2, axis=1, keepdims=True))
    # position_enc = position_enc / (denominator + 1e-8)
    return positional_encoding

positional_encoding = get_positional_encoding(max_seq_len=100, embed_dim=16)
plt.figure(figsize=(10,10))
sns.heatmap(positional_encoding)
plt.title("Sinusoidal Function")
plt.xlabel("hidden dimension")
plt.ylabel("sequence length")

Text(69.0, 0.5, 'sequence length')

在这里插入图片描述

plt.figure(figsize=(8, 5))
plt.plot(positional_encoding[1:, 1], label="dimension 1")
plt.plot(positional_encoding[1:, 2], label="dimension 2")
plt.plot(positional_encoding[1:, 3], label="dimension 3")
plt.legend()
plt.xlabel("Sequence length")
plt.ylabel("Period of Positional Encoding")

Text(0, 0.5, 'Period of Positional Encoding')

在这里插入图片描述

X: [batch_size, len, embedding_size]
W: [embedding_size, hidden_dimension]
XW = [batch_size, len, hidden_dimension]  # embedding对消

2. $\ attention \ mechanism$ , 自注意力机制;

在这里插入图片描述

假设Q，K服从（0,1）正态分布，Q，K点积相当于将方差放大了dk倍，即将注意力矩阵缩放回标准正态分布以获得更好的梯度

Attention Mask

在这里插入图片描述

注意, 在上面 $\ attention$ 的计算过程中, 我们通常使用 $\ batch$ 来计算, 也就是一次计算多句话, 也就是 $X$ 的维度是 $\ size, \ sequence \ length]$ , $\ length$ 是句长, 而一个 $\ batch$ 是由多个不等长的句子组成的, 我们就需要按照这个 $\ batch$ 中最大的句长对剩余的句子进行补齐长度, 我们一般用 $0$ 来进行填充, 这个过程叫做 $p a d d i n g$ .
但这时在进行 $s o f t m a x$ 的时候就会产生问题, 回顾 $s o f t m a x$ 函数 $\sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}$ , $e^0$ 是1, 是有值的, 这样的话 $s o f t m a x$ 中被 $p a d d i n g$ 的部分就参与了运算, 就等于是让无效的部分参与了运算, 会产生很大隐患, 这时就需要做一个 $m a s k$ 让这些无效区域不参与运算, 我们一般给无效区域加一个很大的负数的偏置, 也就是:
$z_{illegal} = z_{illegal} + bias_{illegal}$
$bias_{illegal} \to -\infty$
$e^{z_{illegal}} \to 0$
经过上式的 $m a s k i n g$ 我们使无效区域经过 $s o f t m a x$ 计算之后还几乎为 $0$ , 这样就避免了无效区域参与计算.

3. $\ Normalization$ 和残差连接.

1). 残差连接:
我们在上一步得到了经过注意力矩阵加权之后的 $V$ , 也就是 $\ K, \ V)$ , 我们对它进行一下转置, 使其和 $X_{embedding}$ 的维度一致, 也就是 $\ size, \ sequence \ length, \ embedding \ dimension]$ , 然后把他们加起来做残差连接, 直接进行元素相加, 因为他们的维度一致:
$X_{embedding} + Attention(Q, \ K, \ V)$
在之后的运算里, 每经过一个模块的运算, 都要把运算之前的值和运算之后的值相加, 从而得到残差连接, 训练的时候可以使梯度直接走捷径反传到最初始层:
$\tag{eq. 5}$
2). $L a y e r N o r m$ :
$L a y e r N o r m a l i z a t i o n$ 的作用是把神经网络中隐藏层归一为标准正态分布, 也就是 $i . i . d$ 独立同分布, 以起到加快训练速度, 加速收敛的作用:
$\mu_{i}=\frac{1}{m} \sum^{m}_{i=1}x_{ij}$
上式中以矩阵的行 $(r o w)$ 为单位求均值;
$\sigma^{2}_{j}=\frac{1}{m} \sum^{m}_{i=1} (x_{ij}-\mu_{j})^{2}$
上式中以矩阵的行 $(r o w)$ 为单位求方差;
$LayerNorm(x)=\alpha \odot \frac{x_{ij}-\mu_{i}} {\sqrt{\sigma^{2}_{i}+\epsilon}} + \beta \tag{eq.6}$
然后用每一行的每一个元素减去这行的均值, 再除以这行的标准差, 从而得到归一化后的数值, $\epsilon$ 是为了防止除 $0$ ;
之后引入两个可训练参数 $\alpha, \ \beta$ 来弥补归一化的过程中损失掉的信息, 注意 $\odot$ 表示元素相乘而不是点积, 我们一般初始化 $\alpha$ 为全 $1$ , 而 $\beta$ 为全 $0$ . $\alpha$ 和 $\beta$ 的维度和x维度相同.

4. $\ encoder$ 整体结构.

经过上面3个步骤, 我们已经基本了解到来 $t r a n s f o r m e r$ 编码器的主要构成部分, 我们下面用公式把一个 $\ block$ 的计算过程整理一下:
1). 字向量与位置编码:
$\tag{eq.2}$
$\in \mathbb{R}^{batch \ size \ * \ seq. \ len. \ * \ embed. \ dim.}$
2). 自注意力机制:
$Q = Linear(X) = XW_{Q}$
$XW_{K} \tag{eq.3}$
$V = Linear(X) = XW_{V}$
$X_{attention} = SelfAttention(Q, \ K, \ V) \tag{eq.4}$
3). 残差连接与 $\ Normalization$
$X_{attention} = X + X_{attention} \tag{eq. 5}$
$X_{attention} = LayerNorm(X_{attention}) \tag{eq. 6}$
4). 下面进行 $\ block$ 结构图中的第4部分, 也就是 $F e e d F o r w a r d$ , 其实就是两层线性映射并用激活函数激活, 比如说 $R e L U$ :
$X_{hidden} = Activate(Linear(Linear(X_{attention}))) \tag{eq. 7}$
5). 重复3).:
$X_{hidden} = X_{attention} + X_{hidden}$
$X_{hidden} = LayerNorm(X_{hidden})$
$X_{hidden} \in \mathbb{R}^{batch \ size \ * \ seq. \ len. \ * \ embed. \ dim.}$

小结:
我们到现在位置已经讲完了transformer的编码器的部分, 了解到了transformer是怎样获得自然语言的位置信息的, 注意力机制是怎样的, 其实举个语言情感分类的例子, 我们已经知道, 经过自注意力机制, 一句话中的每个字都含有这句话中其他所有字的信息, 那么我们可不可以添加一个空白字符到句子最前面, 然后让句子中的所有信息向这个空白字符汇总, 然后再映射成想要分的类别呢? 这就是BERT, 我们下次会讲到.
在BERT的预训练中, 我们给每句话的句头加一个特殊字符, 然后句末再加一个特殊字符, 之后模型预训练完毕之后, 我们就可以用句头的特殊字符的 $\ state$ 完成一些分类任务了.

中文自然语言处理 Transformer模型(二) BERT的预训练实践与应用

二. transformer代码解读, 语料数据预处理, BERT的预训练和情感分析的应用:

首先是今天课程内容的顺序, 我将BERT代码解读放到了最后, 把主要内容排在了前面, 注意我们今天使用的是PyTorch深度学习框架, 其实用什么样的框架并不重要, 本节课代码的部分不是重点, 重点是让大家可以掌握 $N L P$ 中语料预处理和建模并解决实际应用中出现的困难的的一些思路, 那话说回来为什么用PyTorch呢?
我其实用Tensorflow的时间要比PyTorch长很多, 但是目前用了PyTorch之后, 我感觉对于NLP来说, PyTorch真的比Tensorflow好用多了, 因为Tensorflow属于静态图, 建模和调试都很麻烦.
尤其是序列模型要定义很多variable scope和name scope之类的, 也就是张量的作用域, 这些东西命名搞不好一不小心就会有bug, 而且有些bug不会报错, 当你发现计算结果不对, 要再返回头debug, 而且Tensorflow的静态图不支持调试, 要用sess.run把想要的结果计算出来才可以.

但是PyTorch是动态图, 就和写numpy一样, 非常方便调试, 而且用class面向对象方式建模, 先声明操作再执行操作, 这样基本不容易在数据流图上出现bug.
如果你从来没用过PyTorch我今天在后面代码部分会带大家大致熟悉一下, 主要是带大家熟悉一下PyTorch的特性, 具体教程官方文档中的快速入门(英文)写的就很好, https://pytorch.org/tutorials/:

进一步理解 $\ encoding$ , 结合注意力矩阵可视化位置编码;
语言模型的定义和BERT解读;
BERT训练之前的准备工作, 语料预处理;
BERT的预训练, 训练参数;
使用BERT预训练模型进行自然语言的情感分类;
BERT代码解读(这部分因为长度原因放在单独一个视频里).

1. 进一步理解 $\ encoding$ , 结合注意力矩阵可视化位置编码;

# 导入依赖库
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from IPython.display import Image 
init_notebook_mode(connected=True)

def get_positional_encoding(max_seq_len, embed_dim):
    # 初始化一个positional encoding
    # embed_dim: 字嵌入的维度
    # max_seq_len: 最大的序列长度
    positional_encoding = np.array([
        [pos / np.power(10000, 2 * i / embed_dim) for i in range(embed_dim)]
        if pos != 0 else np.zeros(embed_dim) for pos in range(max_seq_len)])
    positional_encoding[1:, 0::2] = np.sin(positional_encoding[1:, 0::2])  # dim 2i 偶数
    positional_encoding[1:, 1::2] = np.cos(positional_encoding[1:, 1::2])  # dim 2i+1 奇数
    # 归一化, 用位置嵌入的每一行除以它的模长
    # denominator = np.sqrt(np.sum(position_enc**2, axis=1, keepdims=True))
    # position_enc = position_enc / (denominator + 1e-8)
    return positional_encoding

positional_encoding = get_positional_encoding(max_seq_len=100, embed_dim=128)

# 3d可视化
relation_matrix = np.dot(positional_encoding, positional_encoding.T)[1:, 1:]
data = [go.Surface(z=relation_matrix)]
layout = go.Layout(scene={"xaxis": {'title': "sequence length"}, "yaxis": {"title": "sequence length"}})
fig = go.Figure(data=data, layout=layout)
iplot(fig)

在这里插入图片描述

上图中, 我们用位置编码矩阵乘以(矩阵乘)他本身的转置, 也就是 $PE: \ [seq\_len, \ embedding\_dim ]$ , 我们求 $PEPE^T$ , 得出的维度是 $seq\_len, \ seq\_len ]$ . 我们看到上图中, 矩阵的对角线隆起, 也就是值比较大, 是因为一个矩阵乘以他本身的转置之后, 形成的矩阵的对角线正是这个矩阵的每一行 $(r o w)$ 点乘这一行本身, 所以是值最大的区域(红色部分). 对于位置编码来说, 也就是当前位置与当前位置本身相关程度最高. 再往对角线两边看, 发现以对角线(红色山峰)区域为中心, 两边属于缓慢下降趋势, 这就说明了随着离当前位置越远, 其位置编码的相关程度就越低. 由此可见, 位置编码建立在时间维度的关联关系.

2. 语言模型的定义和BERT解读;

什么是语言模型, 其实用一个公式就可以表示 $P(c_{1},\ldots ,c_{m})$ , 假设我们有一句话, $c_{1}到c_{m}$ 是这句话里的 $m$ 个字, 而语言模型就是求的是这句话出现的概率是多少.

比如说在一个语音识别的场景, 机器听到一句话是"wo wang dai san le(我忘带伞了)", 然后机器解析出两个句子, 一个是"我网袋散了", 另一个是"我忘带伞了", 也就是前者的概率大于后者. 然后语言模型就可以判断 $P (" 我忘带伞了 ") > P (" 我网袋散了 ")$ , 从而得出这句语音的正确解析结果是"我忘带伞了".

BERT的全称是: Bidirectional Encoder Representations from Transformers, 如果翻译过来也就是双向transformer编码表达, 我们在上节课解读了transformer的编码器, 编码器输出的隐藏层就是自然语言序列的数学表达, 那么双向是什么意思呢? 我们来看一下下面这张图.

在这里插入图片描述

上图中 $E_i$ 是指的单个字或词, $T_i$ 指的是最终计算得出的隐藏层, 还记得我们在Transformer(一)中讲到的注意力矩阵和注意力加权, 经过这样的操作之后, 序列里面的每一个字, 都含有这个字前面的信息和后面的信息, 这就是双向的理解, 在这里, 一句话中每一个字, 经过注意力机制和加权之后, 当前这个字等于用这句话中其他所有字重新表达了一遍, 每个字含有了这句话中所有成分的信息.

在BERT中, 主要是以两种预训练的方式来建立语言模型:

BERT语言模型任务一: MASKED LM

在BERT中, Masked LM(Masked language Model)构建了语言模型, 这也是BERT的预训练中任务之一, 简单来说, 就是随机遮盖或替换一句话里面任意字或词, 然后让模型通过上下文的理解预测那一个被遮盖或替换的部分, 之后做 $L o s s$ 的时候只计算被遮盖部分的 $L o s s$ , 其实是一个很容易理解的任务, 实际操作方式如下:

随机把一句话中 $\% 的$ token替换成以下内容:

这些 $t o k e n$ 有 $\% 的几率被替换成$ [mask];
有 $\%$ 的几率被替换成任意一个其他的 $t o k e n$ ;
有 $\%$ 的几率原封不动.

之后让模型预测和还原被遮盖掉或替换掉的部分, 模型最终输出的隐藏层的计算结果的维度是:
$X_{hidden}: [batch\_size, \ seq\_len, \ embedding\_dim]$
我们初始化一个映射层的权重 $W_{vocab}$ :
$W_{vocab}: [embedding\_dim, \ vocab\_size]$
我们用 $W_{vocab}$ 完成隐藏维度到字向量数量的映射, 只要求 $X_{hidden}$ 和 $W_{vocab}$ 的矩阵乘(点积):
$X_{hidden}W_{vocab}: [batch\_size, \ seq\_len, \ vocab\_size] 之后把上面的计算结果在$ vocab_size $(最后一个) 维度做$ softmax $归一化, 是每个字对应的$ vocab_size的和为 $1$ , 我们就可以通过 $vocab\_size里概率最大的字来得到模型的预测结果, 就可以和我们准备好的$ Label做损失( $L o s s$ )并反传梯度了.
注意做损失的时候, 只计算在第1步里当句中随机遮盖或替换的部分, 其余部分不做损失, 对于其他部分, 模型输出什么东西, 我们不在意.

BERT语言模型任务二: Next Sentence Prediction

首先我们拿到属于上下文的一对句子, 也就是两个句子, 之后我们要在这两段连续的句子里面加一些特殊 $t o k e n$ :
$[c l s]$ 上一句话, $[s e p]$ 下一句话. $[s e p]$
也就是在句子开头加一个 $[c l s]$ , 在两句话之中和句末加 $[s e p]$ , 具体地就像下图一样:

在这里插入图片描述

我们看到上图中两句话是 $[c l s]$ my dog is cute $[s e p]$ he likes playing $[s e p]$ , $[c l s]$ 我的狗很可爱 $[s e p]$ 他喜欢玩耍 $[s e p]$ , 除此之外, 我们还要准备同样格式的两句话, 但他们不属于上下文关系的情况;
$[c l s]$ 我的狗很可爱 $[s e p]$ 企鹅不擅长飞行 $[s e p]$ , 可见这属于上下句不属于上下文关系的情况;
在实际的训练中, 我们让上面两种情况出现的比例为 $1 : 1$ , 也就是一半的时间输出的文本属于上下文关系, 一半时间不是.
我们进行完上述步骤之后, 还要随机初始化一个可训练的 $\ embeddings$ , 见上图中, 作用就是用 $e m b e d d i n g s$ 的信息让模型分开上下句, 我们一把给上句全 $0$ 的 $t o k e n$ , 下句啊全 $1$ 的 $t o k e n$ , 让模型得以判断上下句的起止位置, 例如:
$[c l s]$ 我的狗很可爱 $[s e p]$ 企鹅不擅长飞行 $[s e p]$
$\quad \ 0 \ \ 0 \ \ 0 \ \ 0 \ \ 0 \ \ 0 \ \ 0 \ \ \ 1 \ \ 1 \ \ 1 \ \ 1 \ \ 1 \ \ 1 \ \ 1 \ \ 1$
上面 $0$ 和 $1$ 就是 $\ embeddings$ .
还记得我们上节课说过的, 注意力机制就是, 让每句话中的每一个字对应的那一条向量里, 都融入这句话所有字的信息, 那么我们在最终隐藏层的计算结果里, 只要取出 $[c l s] t o k e n$ 所对应的一条向量, 里面就含有整个句子的信息, 因为我们期望这个句子里面所有信息都会往 $[c l s] t o k e n$ 所对应的一条向量里汇总:
模型最终输出的隐藏层的计算结果的维度是:
我们 $X_{hidden}: [batch\_size, \ seq\_len, \ embedding\_dim]$
我们要取出 $[c l s] t o k e n$ 所对应的一条向量, $[c l s]$ 对应着 $seq\_len$ 维度的第 $0$ 条:
$cls\_vector = X_{hidden}[:, \ 0, \ :]$
$cls\_vector \in \mathbb{R}^{batch\_size, \ embedding\_dim}$
之后我们再初始化一个权重, 完成从 $embedding\_dim$ 维度到 $1$ 的映射, 也就是逻辑回归, 之后用 $s i g m o i d$ 函数激活, 就得到了而分类问题的推断.
我们用 $\hat{y}$ 来表示模型的输出的推断, 他的值介于 $\ 1)$ 之间:
$\hat{y} = sigmoid(Linear(cls\_vector)) \quad \hat{y} \in (0, \ 1)$

至此 $B E R T$ 的训练方法就讲完了, 是不是很简单, 下面我们来为 $B E R T$ 的预训练准备数据.

3. BERT训练之前的准备工作, 语料预处理;

字典的制作, 参见目录./corpus/BERT_preprocessing.ipynb文件中的讲解

4. BERT的预训练, 训练参数;

BERT论文中, 推荐的模型参数为: 基准模型 $transformer\_block=12, \ embedding\_dimension=768, \ num\_heads=12, \ Total Param eters=110M)$ , 可见其中共有 $1.1$ 亿参数, 除此之外, 还有比基准模型还大的高性能模型, 参数量为 $3$ 亿, 要想训练并使用这么大参数的模型, 需要充裕的计算资源!

但是经过我的实际测试, 结合我目前正在研究的命名实体识别, 语义分析, 关系抽取和知识图谱的需求, 发现其实这个参数比较过剩, 我们今天训练BERT所用的参数为 $transformer\_block=6, \ embedding\_dimension=384, \ num\_heads=12, \ Total Param eters=23M)$ , 可见我把参数缩减到 $2$ 千万, 但即使这样, 使用一块11GB显存的2080Ti显卡, 训练维基百科语料的BERT也需要一周的时间.

注意我们今天所使用的模型, 是在开源项目 https://github.com/huggingface/pytorch-transformers 的基础上修改而来, 其中我添加了很多中文注释, 添加了预处理模块, 添加了动态padding优化了速度(在后面代码解读的部分会讲到), 添加了情感分析模块等;
中文维基百科语料: https://github.com/brightmart/nlp_chinese_corpus 我只是做了一下预处理, 以适应BERT的预训练, 预处理之后的语料可以在readme.md文件中的百度网盘地址下载;

我已经把使用维基百科语料预训练好的BERT模型上传到了百度网盘, 请在readme.md文件中查看, 我还想提醒大家一下, 网盘上的BERT预训练模型在训练的时候, 使用了一些简单的技巧, 但这些技巧并没有出现在这个教程开源的代码里面, 这是因为某些不方便的原因, 不过我可以告诉大家这些技巧, 大家可以自己实现一下, 另外, 不建议大家用我公开的BERT训练代码来重新训练BERT模型, 因为我上传的已经训练好的BERT性能要更好一些:

BERT训练技巧:

因为我们是按单个字为单位训练BERT, 所以在Masked LM里面, 把句子中的英文单词分出来, 将英文单词所在的区域一起遮盖掉, 让模型预测这个部分;
很多句子里含有数字, 显然在Masked LM中, 让模型准确地预测数据是不现实的, 所以我们把原文中的数字(包括整数和小数)都替换成一个特殊token, #NUM#, 这样模型只要预测出这个地方应该是某些数字就可以来.

BERT训练代码解读在第6部分

5. 使用BERT预训练模型进行自然语言的情感分类;

情感分析语料预处理: 参见目录./corpus/sentiment_preprocessing.ipynb, 我用使用来酒店评论语料, 不过这个语料规模要比2018年用LSTM做情感分析的要大一些, 正面评论和负面评论各5000条, 其实这也是玩具级数据集, 用BERT参数这么大的模型, 训练会产生严重过拟合, 泛化能力差的情况, 这也是我们下面需要解决的问题;
回顾在BERT的训练中Next Sentence Prediction中, 我们取出 $[c l s]$ 对应的那一条向量, 然后把他映射成1个数值并用 $s i g m o i d$ 函数激活:
$\hat{y} = sigmoid(Linear(cls\_vector)) \quad \hat{y} \in (0, \ 1)$
动态学习率和提前终止 $\ stop)$ :
上一步我们将语料划分成了训练和测试集, 我们的训练方式是, 每个 $e p o c h$ , 用训练集训练. 对模型性能的衡量标准是 $A U C$ , $A U C$ 的衡量标准对二分类非常易用, 这里因为时间关系就不讲了, 如果大家不熟悉可以上网搜寻相关资料.
当前 $e p o c h$ 训练完毕之后, 用测试集衡量当前训练结果, 并记下当前 $e p o c h$ 的 $A U C$ , 如果当前的 $A U C$ 较上一个 $e p o c h$ 没有提升, 那就降低学习率, 实际操作是让当前的学习率降低 $1 / 5$ , 直到 $10$ 个 $e p o c h$ 测试集的 $A U C$ 都没有提升, 就终止训练.
我们的初始学习率是 $1 e - 6$ , 因为我们是在维基百科预训练语料的基础上进行训练的, 属于下游任务, 只需要微调预训练模型就好.
解决过拟合问题:
但在实际操作中, 使用 $\hat{y} = sigmoid(Linear(cls\_vector)) \quad \hat{y} \in (0, \ 1)$ 的方式, 发现虽然在训练集和测试集上 $A U C$ 都很高, 但实际随便输入一些从各种网上随便找的一些酒店评论后, 发现泛化能力不好. 这是因为我们的训练数据集非常小, 即使区分训练集和测试集, 但因为整体数据形态比较单一, 模型遇到自己没见过的情况就很容易无法做出正确判断, 为了提高模型的泛化性能, 我尝试了另一种模型结构:

如上图, 我尝试 $\ max \ pool$ 的一种把隐藏层的序列转换为一条向量的方式, 其实就是沿着 $\ length$ 的维度分别求均值和 $m a x$ , 之后拼起来成为一条向量, 之后同样映射成一个值再激活, 伪代码如下:
$X_{hidden}: [batch\_size, \ seq\_len, \ embedding\_dim]$
$mean\_pooled = mean(X_{hidden}, \ dimension=seq\_len) \quad [batch\_size, \ embedding\_dim]$
$max\_pooled = max(X_{hidden}, \ dimension=seq\_len) \quad [batch\_size, \ embedding\_dim]$
$mean\_max\_pooled = concatenate(mean\_pooled, \ max\_pooled, \ dimension=embedding\_dim ) \quad [batch\_size, \ embedding\_dim * 2]$
上式中 $mean\_max\_pooled$ 也就是我们得到的一句话的数学表达, 含有这句话的信息, 其实这也是一种 $D O C 2 V E C$ 的方法, 也就是把一句话转换成一条向量, 而且无论这句话有多长, 转换出来向量的维度都是一样的, 之后可以用这些向量做一些分类聚类等任务.
下一步我们同样做映射, 之后用 $s i g m o i d$ 激活:
$\hat{y} = sigmoid(Linear(mean\_max\_pooled)) \quad \hat{y} \in (0, \ 1)$
怎样理解这样的操作呢, 隐藏层就是一句话的数学表达, 我们求均值和最大值正数学表达对这句话的平均响应, 和最大响应, 之后我们用线性映射来识别这些响应, 从而得到模型的推断结果.

我们还用了 $\ decay$ 的方式, 其实就是 $\ normalization$ , 在PyTorch里有接口可以直接调用, 一会会说到, 其实 $L 2$ 正则的作用就是防止参数的值变得过大或过小, 我们可以设想一下, 由于我们的训练数据很少, 所以实际使用模型进行推断的时候有些字和词或者句子结构的组合模型都是没见过的, 模型里面参数的值很大的话会造成遇到某一些特别的句子或者词语的时候, 模型对句子的响应过大, 导致最终输出的值偏离实际, 其实我们希望模型更从容淡定一些, 所以我们加入 $\ normalization$ .

除此之外, 我们预训练的BERT有6个transformer block, 我们在情感分析的时候, 只用了3个, 因为后面实在是参数太多, 容易导致过拟合, 所以在第三个transformer block之后, 就截出隐藏层进行 $p o o l i n g$ 了, 后面的transformer block都没有用到.

再除此之外, 我使用了 $d r o p o u t$ 机制, $d r o p o u t$ 设为了 $0.4$ , 因为模型参数是在是太多, 所以在训练的时候直接让 $40\%$ 的参数失能, 防止过拟合.

经过以上方法, 模型训练集和测试机的 $A U C$ 都达到了 $0.95$ 以上, 而且经过实际的测试, 模型也可以基本比较正确的分辨出语句的情感极性.

阈值微调:
经过模型的推断, 输出的值介于0到1之间, 我们可以认为只要这个值在0.5以上, 就是正样本, 如果在0.5以下, 就是副样本, 其实这是不一定的, 0.5通常不是最佳的分类边界, 所以我写了一个用来寻找最佳阈值的脚本, 在./metrics/__init__.py里面.
这个脚本的方法是从0.01到0.99定义99个阈值, 高于阈值算正样本, 低于算副样本, 然后与测试集计算 $\ score$ , 之后选出可以使 $\ score$ 最高的阈值, 在训练中, 每一个 $e p o c h$ 都会运行一次寻找阈值的脚本.

import pandas as pd
df = pd.read_pickle("./sentiment_state_dict_mean_max_pool/df_log.pickle")
# 训练日志的尾部, 可见训练集train_auc和测试集test_auc都到达了0.95以上,
# 实际上测试集的auc比训练集还要高, 因为训练集有dropout
df.tail()

	epoch	train_loss	train_auc	test_loss	test_auc
414	414	0.274383	0.954070	0.283045	0.958663
415	415	0.280170	0.952098	0.283048	0.958663
416	416	0.279494	0.952490	0.283052	0.958663
417	417	0.277347	0.953116	0.283056	0.958663
418	418	0.278657	0.952766	0.283057	0.958663

# 让我们来画一下图
import matplotlib.pyplot as plt
plt.plot(df["train_auc"].tolist(), c="b", label="train_auc")
plt.plot(df["test_auc"].tolist(), c="r", label="test_auc")
plt.xlabel("epochs")
plt.ylabel("AUC")
plt.yticks([i/10 for i in range(11)])
plt.grid()
plt.legend()
plt.show()

在这里插入图片描述

情感分析代码解读和实际测试:
代码解读见视频讲解, 下面我们进行测试:

from Sentiment_Inference import *

model = Sentiment_Analysis(max_seq_len=300, batch_size=2)

./sentiment_state_dict_mean_max_pool/sentiment.model.epoch.418 loaded!

# https://www.booking.com/reviews.zh-cn.html
test_list = [
    "有几次回到酒店房间都没有被整理。两个人入住，只放了一套洗漱用品。",
    "早餐时间询问要咖啡或茶，本来是好事，但每张桌子上没有放“怡口糖”（代糖），又显得没那么周到。房间里卫生间用品补充，有时有点漫不经心个人觉得酒店房间禁烟比较好",
    '十六浦酒店有提供港澳码头的SHUTTLE BUS, 但氹仔没有订了普通房, 可能是会员的关系 UPGRADE到了DELUXE房,风景是绿色的河, 感观一般, 但房间还是不错的, 只是装修有点旧了另外品尝了酒店的自助晚餐, 种类不算多, 味道OK, 酒类也免费任饮, 这个不错最后就是在酒店的娱乐场赢了所有费用, 一切都值得了!',
    '地理位置优越，出门就是步行街，也应该是耶路撒冷的中心地带，去老城走约20分钟。房间很实用，虽然不含早餐，但是楼下周边有很多小超市和餐厅、面包店，所以一切都不是问题。',
    '实在失望！如果果晚唔系送朋友去码头翻香港一定会落酒店大堂投诉佢！太离谱了！我地吃个晚饭消费千几蚊 ，买单个黑色衫叫Annie果个唔知系部长定系经理录左我万几蚊！简直系离晒大谱的 ！咁样的管理层咁大间酒店真的都不敢恭维！',
    '酒店服务太棒了, 服务态度非常好, 房间很干净',
    "服务各方面没有不周到而的地方, 各方面没有没想到的细节",
    "房间设施比较旧，虽然是古典风格，但浴室的浴霸比较不好用。很不满意的是大厅坐下得消费，不人性化，而且糕点和沙拉很难吃，贵而且是用塑料盒子装的，5星级？特别是青团，58块钱4个，感觉放了好几天了，超级难吃。。。把外国朋友吓坏了。。。",
    "南京东路地铁出来就能看到，很方便。酒店大堂和房间布置都有五星级的水准。",
    "服务不及5星，前台非常不专业，入住时会告知你没房要等，不然就加钱升级房间。前台个个冰块脸，对待客人好像仇人一般，带着2岁的小孩前台竟然还要收早餐费。门口穿白衣的大爷是木头人，不会提供任何帮助。入住期间想要多一副牙刷给孩子用，竟然被问为什么。五星设施，一星服务，不会再入住！"
]
model(test_list)
```python

有几次回到酒店房间都没有被整理。两个人入住，只放了一套洗漱用品。
负样本, 输出值0.19
----------
早餐时间询问要咖啡或茶，本来是好事，但每张桌子上没有放“怡口糖”（代糖），又显得没那么周到。房间里卫生间用品补充，有时有点漫不经心个人觉得酒店房间禁烟比较好
正样本, 输出值0.56
----------
十六浦酒店有提供港澳码头的SHUTTLE BUS, 但氹仔没有订了普通房, 可能是会员的关系 UPGRADE到了DELUXE房,风景是绿色的河, 感观一般, 但房间还是不错的, 只是装修有点旧了另外品尝了酒店的自助晚餐, 种类不算多, 味道OK, 酒类也免费任饮, 这个不错最后就是在酒店的娱乐场赢了所有费用, 一切都值得了!
正样本, 输出值0.99
----------
地理位置优越，出门就是步行街，也应该是耶路撒冷的中心地带，去老城走约20分钟。房间很实用，虽然不含早餐，但是楼下周边有很多小超市和餐厅、面包店，所以一切都不是问题。
正样本, 输出值0.96
----------
实在失望！如果果晚唔系送朋友去码头翻香港一定会落酒店大堂投诉佢！太离谱了！我地吃个晚饭消费千几蚊 ，买单个黑色衫叫Annie果个唔知系部长定系经理录左我万几蚊！简直系离晒大谱的 ！咁样的管理层咁大间酒店真的都不敢恭维！
负样本, 输出值0.05
----------
酒店服务太棒了, 服务态度非常好, 房间很干净
正样本, 输出值0.88
----------
服务各方面没有不周到而的地方, 各方面没有没想到的细节
负样本, 输出值0.03
----------
房间设施比较旧，虽然是古典风格，但浴室的浴霸比较不好用。很不满意的是大厅坐下得消费，不人性化，而且糕点和沙拉很难吃，贵而且是用塑料盒子装的，5星级？特别是青团，58块钱4个，感觉放了好几天了，超级难吃。。。把外国朋友吓坏了。。。
负样本, 输出值0.18
----------
南京东路地铁出来就能看到，很方便。酒店大堂和房间布置都有五星级的水准。
正样本, 输出值0.98
----------
服务不及5星，前台非常不专业，入住时会告知你没房要等，不然就加钱升级房间。前台个个冰块脸，对待客人好像仇人一般，带着2岁的小孩前台竟然还要收早餐费。门口穿白衣的大爷是木头人，不会提供任何帮助。入住期间想要多一副牙刷给孩子用，竟然被问为什么。五星设施，一星服务，不会再入住！
负样本, 输出值0.01
----------
    


```python
text = "对于这个亲子房来说，没有浴缸对于比较小的小朋友来说可能会有点不太方便，小的时候不太会站立洗澡的，所以可能需要洗盆浴，我们宝宝4岁了，其实也没有关系，但是之前有自己经历过带6个月宝宝出去玩的，很多店家觉得浴缸浪费空间所以都只有淋浴房。但是自己给宝宝洗澡的时候就非常尴尬…不知道这家是不是可以有租用的。因为我们不是一定需要，也没有做询问。"
model(text)

对于这个亲子房来说，没有浴缸对于比较小的小朋友来说可能会有点不太方便，小的时候不太会站立洗澡的，所以可能需要洗盆浴，我们宝宝4岁了，其实也没有关系，但是之前有自己经历过带6个月宝宝出去玩的，很多店家觉得浴缸浪费空间所以都只有淋浴房。但是自己给宝宝洗澡的时候就非常尴尬…不知道这家是不是可以有租用的。因为我们不是一定需要，也没有做询问。
负样本, 输出值0.31
----------

bert实现

# coding=utf-8
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch BERT model."""

from __future__ import absolute_import, division, print_function, unicode_literals

import copy
import math
import sys
from io import open
import torch
from torch import nn
from torch.nn import CrossEntropyLoss


def gelu(x):
    """Implementation of the gelu activation function.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))


ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu}


class BertConfig(object):
    """Configuration class to store the configuration of a `BertModel`.
    """
    def __init__(self,
                 vocab_size, # 字典字数
                 hidden_size=384, # 隐藏层维度也就是字向量维度
                 num_hidden_layers=6, # transformer block 的个数
                 num_attention_heads=12, # 注意力机制"头"的个数
                 intermediate_size=384*4, # feedforward层线性映射的维度
                 hidden_act="gelu", # 激活函数
                 hidden_dropout_prob=0.4, # dropout的概率
                 attention_probs_dropout_prob=0.4,
                 max_position_embeddings=512*2,
                 type_vocab_size=256, # 用来做next sentence预测,
                 # 这里预留了256个分类, 其实我们目前用到的只有0和1
                 initializer_range=0.02 # 用来初始化模型参数的标准差
                 ):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range

class BertEmbeddings(nn.Module):
    """LayerNorm层, 见Transformer(一), 讲编码器(encoder)的第1部分"""
    """Construct the embeddings from word, position and token_type embeddings.
    """
    def __init__(self, config):
        super(BertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        # embedding矩阵初始化
        nn.init.orthogonal_(self.word_embeddings.weight)
        nn.init.orthogonal_(self.token_type_embeddings.weight)

        # embedding矩阵进行归一化
        epsilon = 1e-8
        self.word_embeddings.weight.data = \
            self.word_embeddings.weight.data.div(torch.norm(self.word_embeddings.weight, p=2, dim=1, keepdim=True).data + epsilon)
        self.token_type_embeddings.weight.data = \
            self.token_type_embeddings.weight.data.div(torch.norm(self.token_type_embeddings.weight, p=2, dim=1, keepdim=True).data + epsilon)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, positional_enc, token_type_ids=None):
        """
        :param input_ids: 维度 [batch_size, sequence_length]
        :param positional_enc: 位置编码 [sequence_length, embedding_dimension]
        :param token_type_ids: BERT训练的时候, 第一句是0, 第二句是1
        :return: 维度 [batch_size, sequence_length, embedding_dimension]
        """
        # 字向量查表
        words_embeddings = self.word_embeddings(input_ids)

        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = words_embeddings + positional_enc + token_type_embeddings
        # embeddings: [batch_size, sequence_length, embedding_dimension]
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings


class BertSelfAttention(nn.Module):
    """自注意力机制层, 见Transformer(一), 讲编码器(encoder)的第2部分"""
    def __init__(self, config):
        super(BertSelfAttention, self).__init__()
        # 判断embedding dimension是否可以被num_attention_heads整除
        if config.hidden_size % config.num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size
        # Q, K, V线性映射
        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        # 输入x为QKV中的一个, 维度: [batch_size, seq_length, embedding_dim]
        # 输出的维度经过reshape和转置: [batch_size, num_heads, seq_length, embedding_dim / num_heads]
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask, get_attention_matrices=False):
        # Q, K, V线性映射
        # Q, K, V的维度为[batch_size, seq_length, num_heads * embedding_dim]
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)
        # 把QKV分割成num_heads份
        # 把维度转换为[batch_size, num_heads, seq_length, embedding_dim / num_heads]
        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        # Q与K求点积
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        # attention_scores: [batch_size, num_heads, seq_length, seq_length]
        # 除以K的dimension, 开平方根以归一为标准正态分布
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        attention_scores = attention_scores + attention_mask
        # attention_mask 注意力矩阵mask: [batch_size, 1, 1, seq_length]
        # 元素相加后, 会广播到维度: [batch_size, num_heads, seq_length, seq_length]

        # softmax归一化, 得到注意力矩阵
        # Normalize the attention scores to probabilities.
        attention_probs_ = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs_)

        # 用注意力矩阵加权V
        context_layer = torch.matmul(attention_probs, value_layer)
        # 把加权后的V reshape, 得到[batch_size, length, embedding_dimension]
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        # 输出attention矩阵用来可视化
        if get_attention_matrices:
            return context_layer, attention_probs_
        return context_layer, None

class BertLayerNorm(nn.Module):
    """LayerNorm层, 见Transformer(一), 讲编码器(encoder)的第3部分"""
    def __init__(self, hidden_size, eps=1e-12):
        """Construct a layernorm module in the TF style (epsilon inside the square root).
        """
        super(BertLayerNorm, self).__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        u = x.mean(-1, keepdim=True)
        s = (x - u).pow(2).mean(-1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
        return self.weight * x + self.bias


class BertSelfOutput(nn.Module):
    # 封装的LayerNorm和残差连接, 用于处理SelfAttention的输出
    def __init__(self, config):
        super(BertSelfOutput, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states


class BertAttention(nn.Module):
    # 封装的多头注意力机制部分, 包括LayerNorm和残差连接
    def __init__(self, config):
        super(BertAttention, self).__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)

    def forward(self, input_tensor, attention_mask, get_attention_matrices=False):
        self_output, attention_matrices = self.self(input_tensor, attention_mask, get_attention_matrices=get_attention_matrices)
        attention_output = self.output(self_output, input_tensor)
        return attention_output, attention_matrices


class BertIntermediate(nn.Module):
    # 封装的FeedForward层和激活层
    def __init__(self, config):
        super(BertIntermediate, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
        self.intermediate_act_fn = ACT2FN[config.hidden_act]

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.intermediate_act_fn(hidden_states)
        return hidden_states


class BertOutput(nn.Module):
    # 封装的LayerNorm和残差连接, 用于处理FeedForward层的输出
    def __init__(self, config):
        super(BertOutput, self).__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states


class BertLayer(nn.Module):
    # 一个transformer block
    def __init__(self, config):
        super(BertLayer, self).__init__()
        self.attention = BertAttention(config)
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)

    def forward(self, hidden_states, attention_mask, get_attention_matrices=False):
        # Attention层(包括LayerNorm和残差连接)
        attention_output, attention_matrices = self.attention(hidden_states, attention_mask, get_attention_matrices=get_attention_matrices)
        # FeedForward层
        intermediate_output = self.intermediate(attention_output)
        # LayerNorm与残差连接输出层
        layer_output = self.output(intermediate_output, attention_output)
        return layer_output, attention_matrices

class BertEncoder(nn.Module):
    # transformer blocks * N
    def __init__(self, config):
        super(BertEncoder, self).__init__()
        layer = BertLayer(config)
        # 复制N个transformer block
        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])

    def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True, get_attention_matrices=False):
        """
        :param output_all_encoded_layers: 是否输出每一个transformer block的隐藏层计算结果
        :param get_attention_matrices: 是否输出注意力矩阵, 可用于可视化
        """
        all_attention_matrices = []
        all_encoder_layers = []
        for layer_module in self.layer:
            hidden_states, attention_matrices = layer_module(hidden_states, attention_mask, get_attention_matrices=get_attention_matrices)
            if output_all_encoded_layers:
                all_encoder_layers.append(hidden_states)
                all_attention_matrices.append(attention_matrices)
        if not output_all_encoded_layers:
            all_encoder_layers.append(hidden_states)
            all_attention_matrices.append(attention_matrices)
        return all_encoder_layers, all_attention_matrices


class BertPooler(nn.Module):
    """Pooler是把隐藏层(hidden state)中对应#CLS#的token的一条提取出来的功能"""
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

# 线性映射, 激活, LayerNorm
class BertPredictionHeadTransform(nn.Module):
    def __init__(self, config):
        super(BertPredictionHeadTransform, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.transform_act_fn = ACT2FN[config.hidden_act]
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.transform_act_fn(hidden_states)
        hidden_states = self.LayerNorm(hidden_states)
        return hidden_states


class BertLMPredictionHead(nn.Module):
    def __init__(self, config, bert_model_embedding_weights):
        super(BertLMPredictionHead, self).__init__()
        # 线性映射, 激活, LayerNorm
        self.transform = BertPredictionHeadTransform(config)

        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        self.decoder = nn.Linear(bert_model_embedding_weights.size(1),
                                 bert_model_embedding_weights.size(0),
                                 bias=False)
        """上面是创建一个线性映射层, 把transformer block输出的[batch_size, seq_len, embed_dim]
        映射为[batch_size, seq_len, vocab_size], 也就是把最后一个维度映射成字典中字的数量, 
        获取MaskedLM的预测结果, 注意这里其实也可以直接矩阵成embedding矩阵的转置, 
        但一般情况下我们要随机初始化新的一层参数
        """
        self.decoder.weight = bert_model_embedding_weights
        self.bias = nn.Parameter(torch.zeros(bert_model_embedding_weights.size(0)))

    def forward(self, hidden_states):
        hidden_states = self.transform(hidden_states)
        hidden_states = self.decoder(hidden_states) + self.bias
        return hidden_states

# BERT的训练中通过隐藏层输出Masked LM的预测和Next Sentence的预测
class BertPreTrainingHeads(nn.Module):
    """
    BERT的训练中通过隐藏层输出Masked LM的预测和Next Sentence的预测
    """
    def __init__(self, config, bert_model_embedding_weights):
        super(BertPreTrainingHeads, self).__init__()

        self.predictions = BertLMPredictionHead(config, bert_model_embedding_weights)
        # 把transformer block输出的[batch_size, seq_len, embed_dim]
        # 映射为[batch_size, seq_len, vocab_size]
        # 用来进行MaskedLM的预测
        self.seq_relationship = nn.Linear(config.hidden_size, 2)
        # 用来把pooled_output也就是对应#CLS#的那一条向量映射为2分类
        # 用来进行Next Sentence的预测

    def forward(self, sequence_output, pooled_output):
        prediction_scores = self.predictions(sequence_output)
        seq_relationship_score = self.seq_relationship(pooled_output)
        return prediction_scores, seq_relationship_score

# 用来初始化模型参数
class BertPreTrainedModel(nn.Module):
    """ An abstract class to handle weights initialization and
        a simple interface for dowloading and loading pretrained models.
        用来初始化模型参数
    """
    def __init__(self, config, *inputs, **kwargs):
        super(BertPreTrainedModel, self).__init__()
        if not isinstance(config, BertConfig):
            raise ValueError(
                "Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
                "To create a model from a Google pretrained model use "
                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
                    self.__class__.__name__, self.__class__.__name__
                ))
        self.config = config

    def init_bert_weights(self, module):
        """ Initialize the weights.
        """
        if isinstance(module, (nn.Linear)):
            # 初始线性映射层的参数为正态分布
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        elif isinstance(module, BertLayerNorm):
            # 初始化LayerNorm中的alpha为全1, beta为全0
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        if isinstance(module, nn.Linear) and module.bias is not None:
            # 初始化偏置为0
            module.bias.data.zero_()


class BertModel(BertPreTrainedModel):
    """BERT model ("Bidirectional Embedding Representations from a Transformer").

    Params:
        config: a BertConfig class instance with the configuration to build a new model

    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.

    Outputs: Tuple of (encoded_layers, pooled_output)
        `encoded_layers`: controled by `output_all_encoded_layers` argument:
            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
                to the last attention block of shape [batch_size, sequence_length, hidden_size],
        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
            classifier pretrained on top of the hidden state associated to the first character of the
            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = modeling.BertModel(config=config)
    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, positional_enc, token_type_ids=None, attention_mask=None,
                output_all_encoded_layers=True, get_attention_matrices=False):
        if attention_mask is None:
            # torch.LongTensor
            # attention_mask = torch.ones_like(input_ids)
            attention_mask = (input_ids > 0)
            # attention_mask [batch_size, length]
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
        # 注意力矩阵mask: [batch_size, 1, 1, seq_length]

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        # 给注意力矩阵里padding的无效区域加一个很大的负数的偏置, 为了使softmax之后这些无效区域仍然为0, 不参与后续计算

        # embedding层
        embedding_output = self.embeddings(input_ids, positional_enc, token_type_ids)
        # 经过所有定义的transformer block之后的输出
        encoded_layers, all_attention_matrices = self.encoder(embedding_output,
                                                              extended_attention_mask,
                                                              output_all_encoded_layers=output_all_encoded_layers,
                                                              get_attention_matrices=get_attention_matrices)
        # 可输出所有层的注意力矩阵用于可视化
        if get_attention_matrices:
            return all_attention_matrices
        # [-1]为最后一个transformer block的隐藏层的计算结果
        sequence_output = encoded_layers[-1]
        # pooled_output为隐藏层中#CLS#对应的token的一条向量
        pooled_output = self.pooler(sequence_output)
        if not output_all_encoded_layers:
            encoded_layers = encoded_layers[-1]
        return encoded_layers, pooled_output
    

class BertForPreTraining(BertPreTrainedModel):
    """BERT model with pre-training heads.
    This module comprises the BERT model followed by the two pre-training heads:
        - the masked language modeling head, and
        - the next sentence classification head.

    Params:
        config: a BertConfig class instance with the configuration to build a new model.

    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `masked_lm_labels`: optional masked language modeling labels: torch.LongTensor of shape [batch_size, sequence_length]
            with indices selected in [-1, 0, ..., vocab_size]. All labels set to -1 are ignored (masked), the loss
            is only computed for the labels set in [0, ..., vocab_size]
        `next_sentence_label`: optional next sentence classification loss: torch.LongTensor of shape [batch_size]
            with indices selected in [0, 1].
            0 => next sentence is the continuation, 1 => next sentence is a random sentence.

    Outputs:
        if `masked_lm_labels` and `next_sentence_label` are not `None`:
            Outputs the total_loss which is the sum of the masked language modeling loss and the next
            sentence classification loss.
        if `masked_lm_labels` or `next_sentence_label` is `None`:
            Outputs a tuple comprising
            - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
            - the next sentence classification logits of shape [batch_size, 2].

    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])

    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)

    model = BertForPreTraining(config)
    masked_lm_logits_scores, seq_relationship_logits = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config):
        super(BertForPreTraining, self).__init__(config)
        self.bert = BertModel(config)
        self.cls = BertPreTrainingHeads(config, self.bert.embeddings.word_embeddings.weight)
        self.apply(self.init_bert_weights)
        self.vocab_size = config.vocab_size
        self.next_loss_func = CrossEntropyLoss()
        self.mlm_loss_func = CrossEntropyLoss(ignore_index=0)

    def compute_loss(self, predictions, labels, num_class=2, ignore_index=-100):
        loss_func = CrossEntropyLoss(ignore_index=ignore_index)
        return loss_func(predictions.view(-1, num_class), labels.view(-1))

    def forward(self, input_ids, positional_enc, token_type_ids=None, attention_mask=None,
                masked_lm_labels=None, next_sentence_label=None):
        sequence_output, pooled_output = self.bert(input_ids, positional_enc, token_type_ids, attention_mask,
                                                   output_all_encoded_layers=False)
        mlm_preds, next_sen_preds = self.cls(sequence_output, pooled_output)
        return mlm_preds, next_sen_preds

中文自然语言处理 隐马尔可夫模型与命名实体识别NER

我们今天要解决的问题是自然语言处理中的序列标注问题, 在目前, 比较主流的技术是语言模型(如LSTM, BERT)+CRF(条件随机场), 为什么这样组合模型呢? 我稍后会讲到. 但想要了解CRF(条件随机场), 我想首先让大家了解一下隐马尔可夫模型(Hidden Markov Model), 是一种概率图模型, 只要理解了HMM模型和维特比解码算法(viterbi algorothm), 理解条件随机场就成了分分钟的事.
在这节课中, 你不需要有概率图模型的基础, 只要有基本的概率论知识即可.
首先, 先来看一下今天的课程安排:
0. NER(命名实体识别)问题概述;

什么是隐马尔可夫模型(HMM);
HMM模型的参数;
用HMM解决序列标注问题, HMM的学习算法;
维特比算法(Viterbi Algorithm)(HMM的预测算法).

0. named entity recognition(命名实体识别)问题概述:

命名实体识别（英语：Named Entity Recognition，简称NER）, 是指识别文本中具有特定意义的实体，主要包括人名、地名、机构名、专有名词等等, 并把我们需要识别的词在文本序列中标注出来。
例如有一段文本: 济南市成立自由贸易试验区.
我们要在上面文本中识别一些区域和地点, 那么我们需要识别出来内容有:
济南市(地点), 自由贸易试验区(地点).
在我们今天使用的NER数据集中, 一共有7个标签:

“B-ORG”: 组织或公司(organization)
“I-ORG”: 组织或公司
“B-PER”: 人名(person)
“I-PER”: 人名
“O”: 其他非实体(other)
“B-LOC”: 地名(location)
“I-LOC”: 地名

文本中以每个字为单位, 每个字必须分别对应上面的任一标签.
但为什么上面标签除了"O"(其他)之外都是一个实体类型对应两个标签呢?
请小伙伴们仔细看标签前面有分为"B"和"I"的不同, "B"表示begin, 实体开头的那个字使用"B"对应的标签来标注, 在实体中间或结尾的部分, 用"I"来标注.
比如说"自贸区"对应的标注是: 自(B-LOC)贸(I-LOC)区(I-LOC), 这三个字都对应一个"地名"的标签, 但是第一个字属于实体开头的字, 所以使用"B"开头的标签, 后面两个字的标签都是"I"开头.
注意, "B"后面是不可以跟其他类型的"I"的, 例如: 自(B-PER)贸(I-LOC)区(I-LOC) 就是属于错误的标注, 因为实体开头"B"标注成了人名, 即使实体中间标注成了地名, 这个实体的标注方法也是非法的.
上面的原因就是我们要从语言模型(例如BERT, LSTM)后面再加上概率图模型, 例如条件随机场, 用来约束模型的输出, 防止出现不合规的标注输出.

1. 什么是隐马尔可夫模型 $a . k . a . H M M ?$

HMM模型是概率图模型的一种, 属于生成模型, 笼统的说, 我们上面说的"BIO"的实体标签, 就是一个不可观测的隐状态, 而HMM模型描述的就是由这些隐状态序列(实体标记)生成可观测状态(可读文本)的过程.
在我们今天的问题当中, 隐状态序列是实体标记序列, 而可观测序列是我们可读的原始语料文本序列.
例如:
隐藏状态序列: $B - L O C ∣ I - L O C ∣ I - L O C$
观测状态序列: $\quad \quad \quad \quad 贸 \quad \quad \quad \quad 区$
设我们的可观测状态序列是由所有汉字组成的集合, 我们用 $V_{Obsevation}$ 来表示:
$V_{obs.}=\{v_1, v_2, ... , v_M \}$
上式中, $v$ 表示字典中单个字, 假设我们已知的字数为 $M$ .
设所有可能的隐藏状态集合为 $Q_{hidden}$ , 一共有 $N$ 种隐藏状态, 例如我们现在的命名实体识别数据里面只有7种标签:
$Q_{hidden} = \{ q_1, q_2, ... , q_N\}$
设我们有观测到的一串自然语言序列文本 $O$ , 一共有 $T$ 个字, 又有这段观测到的文本所对应的实体标记, 也就是隐状态 $I$ :
$I=\{i_1, i_2, ... , i_T \}(隐状态) \quad O=\{o_1, o_2, ... , o_T \}(观测)$
注意上式中, 我们常称 $t$ 为时刻, 如上式中一共有 $T$ 个时刻( $T$ 个汉字).

在这里插入图片描述

HMM模型有两个基本假设(非常重要):

第 $t$ 个隐状态(实体标签)只跟前一时刻的 $t - 1$ 隐状态(实体标签)有关, 与除此之外的其他隐状态(如 $t-2,\ t+3$ )无关.
例如上图中: 蓝色的部分指的是 $i_t$ 只与 $i_{t-1}$ 有关, 而与蓝色区域之外的所有内容都无关, 而 $P(i_{t}|i_{t-1})$ 指的是隐状态 $i$ 从 $t - 1$ 时刻转向 $t$ 时刻的概率, 具体转换方式下面会细讲.
观测独立的假设, 我们上面说过, HMM模型中是由隐状态序列(实体标记)生成可观测状态(可读文本)的过程,
观测独立假设是指在任意时刻观测 $o_t$ 只依赖于当前时刻的隐状态 $i_t$ , 与其他时刻的隐状态无关.
例如上图中: 粉红色的部分指的是 $i_{t+1}$ 只与 $o_{t+1}$ 有关, 跟粉红色区域之外的所有内容都无关.

2. HMM模型的参数:

HMM的转移概率(transition probabilities):
我们上面提到了 $P(i_{t}|i_{t-1})$ 指的是隐状态 $i$ 从 $t - 1$ 时刻转向 $t$ 时刻的概率, 比如说我们现在实体标签一共有 $7$ 种, 也就是 $N = 7$ (注意 $N$ 是所有可能的实体标签种类的集合), 也就是 $Q_{hidden} = \{ q_0, q_1, ... , q_6\}$ (注意我们实体标签编号从 $0$ 算起), 假设在 $t - 1$ 时刻任何一种实体标签都可以在 $t$ 时刻转换为任何一种其他类型的实体标签, 则总共可能的转换的路径一共有 $N^2$ 种, 所以我们可以做一个 $N * N$ 的矩阵来表示所有可能的隐状态转移概率.

上图就是转移概率矩阵, 也就是 $\ matrix$ , 我们设这个矩阵为 $A$ 矩阵, 则 $A_{ij}$ 表示矩阵中第i行第j列:
$A_{ij}=P(i_{t+1}= q_j | i_{t} = q_i) \quad q_i \in Q_{hidden}$
上式表示指的是在 $t$ 时刻实体标签为 $q_i$ , 而在 $t + 1$ 时刻实体标签转换到 $q_j$ 的概率.
2. HMM的发射概率(emission probabilities):
我们之前提到了任意时刻观测 $o_t$ 只依赖于当前时刻的隐状态 $i_t$ , 也就是 $P(o_t | i_t)$ , 也叫做发射概率, 指的是隐状态生成观测结果的过程.
设我们的字典里有 $M$ 个字, $V_{obs.}=\{v_0, v_1, ... , v_{M-1} \}$ (注意这里下标从0算起, 所以最后的下标是 $M - 1$ , 一共有 $M$ 种观测), 则每种实体标签(隐状态)可以生成 $M$ 种不同的汉字(也就是观测), 这一过程可以用一个发射概率矩阵来表示, 他的维度是 $N * M$ .
在这里插入图片描述

上图就是发射概率矩阵, 也就是 $\ matrix$ , 我们设这个矩阵为 $B$ 矩阵, 则 $B_{jk}$ 表示矩阵中第 $j$ 行第 $k$ 列:
$B_{jk}=P(o_{t}= v_k | i_{t} = q_j) \quad q_i \in Q_{hidden} \quad v_k \in V_{obs.}=\{v_0, v_1, ... , v_{M-1} \}$
上式表示指的是在 $t$ 时刻由实体标签(隐状态) $q_j$ 生成汉字(观测结果) $v_k$ 的概率.
3. HMM的初始隐状态概率: 又称为 $\ probabilities$ , 我们通常用 $\pi$ 来表示, 注意这里可不是圆周率:
$\pi=P(i_1=q_i) \quad q_i \in Q_{hidden} = \{ q_0, q_1, ... , q_{N-1}\}$
上式指的是自然语言序列中第一个字 $o_1$ 的实体标记是 $q_i$ 的概率, 也就是初始隐状态概率.

3. 用HMM解决序列标注问题, HMM的学习算法;

我们现在已经了解了HMM的三大参数 $\ B, \ \pi$ , 假设我们已经通过建模学习, 学到了这些参数, 得到了模型的概率, 我们怎么使用这些参数来解决序列标注问题呢?
设目前在时刻 $t$ , 我们有当前时刻的观测到的一个汉字 $o_t=v_k$ (指的第 $t$ 时刻观测到 $v_k$ ), 假设我们还知道在 $t - 1$ 时刻(前一时刻)对应的实体标记类型 $i_{t-1} = \hat{q}^{t-1}_i$ (指的 $t - 1$ 时刻标记为 $\hat{q}^{t-1}_i$ ). 我们要做的仅仅是列举所有 $i_{t}$ 可能的实体标记 $\hat{q}^{t}_{j}$ , 并求可以使下式输出值最大的那个实体类型 $q^{t}_{j}$ (也就是隐状态类型):
$\hat{q}_j^{t} = argmax_{\hat{q}_j^{t} \in Q_{hidden}} P(i_t = \hat{q}_j^{t} | i_{t-1} = \hat{q}^{t-1}_i) P(o_t=v_k| i_t = \hat{q}_j^{t})$
将所有 $t$ 时刻当前可取的实体标签带入下式中, 找出一个可以使下式取值最大的那个实体标签作为当前字的标注:
$P (当前可取实体标签 ∣ 上一时刻实体标签) P (测到的汉字 ∣ 当前可取实体标签)$
注意: 我们这里只讲到了怎样求第 $t$ 时刻的最优标注, 但是在每一时刻进行这样的计算, 并不一定能保证最后能得出全局最优序列路径, 例如在第 $t$ 时刻最优实体标签是 $q_j$ , 但到了下一步, 由于从 $q_j$ 转移到其他某些实体标签的转移概率比较低, 而降低了经过 $q_j$ 的路径的整体概率, 所以到了下一时刻最优路径就有可能在第 $t$ 时刻不经过 $q_j$ 了, 所以每一步的局部最优并不一定可以达成全局最优, 所以我们之后会用到维特比算法来找到全局最优的标注序列, 这个后面会有详细讲解.

生成模型与判别模型:
对于生成模型与判别模型, 因为篇幅问题, 暂不做讲述, 网上有很多资料.
这里稍稍回顾一下, 我们假设 $x$ 为数据点, $y$ 为数据标记, 比如说逻辑回归属于典型的判别模型, 我们要计算 $P (y ∣ x)$ 并形成一条分类边界, 而在HMM中, 我们计算的是 $P (x ∣ y)$ , 而且要计算出所有 $y$ 可取的类型, 并比较一下所有 $P(x|y=y_{i})$ 的结果, 并取可以使 $P (x ∣ y)$ 最大的那个, 而得到预测结果.

HMM参数学习(监督学习):
我们今天要用HMM解决的是序列标注问题, 所以我们解决的是监督学习的问题. 也就是说我们现在有一些文本和与之对应的标注数据, 我们要训练一个HMM来拟合这些数据, 以便之后用这个模型进行数据标注任务, 最简单的方式是直接用极大似然估计来估计参数:

初始隐状态概率 $\pi$ 的参数估计:
$\hat{\pi}_{q_i}=\frac{count(q^{1}_{i})}{count(o_1)}$
上式指的是, 计算在第 $1$ 时刻, 也就是文本中第一个字, $q^{1}_{i}$ 出现的次数占总第一个字 $o_1$ 观测次数的比例, $q^{1}_{i}$ 上标1指的是第1时刻, 下标 $i$ 指的是第 $i$ 种标签(隐状态), $c o u n t$ 是的是记录次数.
转移概率矩阵 $A$ 的参数估计:
我们之前提到过 $\ matrix$ 里面 $A_{ij}$ (矩阵的第i行第j列)指的是在 $t$ 时刻实体标签为 $q_i$ , 而在 $t + 1$ 时刻实体标签转换到 $q_j$ 的概率, 则转移概率矩阵的参数估计相当与一个二元模型 $b i g r a m$ , 也就是把所有的标注序列中每相邻的两个实体标签分成一组, 统计他们出现的概率:
$\hat{A}_{ij}=P(i_{t+1}= q_j | i_{t} = q_i)=\frac{count(q_i后面出现q_j的次数)}{count(q_i的次数)}$
发射概率矩阵 $B$ 的参数估计:
我们提到过 $\ matrix$ 中的 $B_{jk}$ (矩阵第j行第k列)指的是在 $t$ 时刻由实体标签(隐状态) $q_j$ 生成汉字(观测结果) $v_k$ 的概率.
$\hat{B}_{jk}=P(o_{t}= v_k | i_{t} = q_j)=\frac{count(q_j与v_k同时出现的次数)}{count(q_j出现的次数)}$
到此为止, 我们就可以遍历所有语料, 根据上面的方式得到模型的参数 $\ B, \ \pi$ 的估计.

注意, 通过上面的计算过程, 我们可以得出HMM的参数 $\pi)$ 有以下特性:
$\sum_{i}\pi_{q_i} = 1$
$\sum_{j}A_{ij} = \sum_{j}P(i_{t+1}= q_j | i_{t} = q_i) = 1$
$\sum_{k}B_{jk} = \sum_{k}P(o_{t}= v_k | i_{t} = q_j) =1$

4. 维特比算法(Viterbi Algorithm)(HMM的预测算法).

维特比算法 $\ algorithm$ 使用了动态规划算法来解决类似HMM和CRF的预测问题, 用维特比算法可以找到概率最大路径, 也就是最优路径, 在我们今天要解决的序列标注问题中, 就要通过维特比算法, 来找到文本所对应的最优的实体标注序列.
如果用一句话来概括维特比算法, 那就是:
在每一时刻, 计算当前时刻落在每种隐状态的最大概率, 并记录这个最大概率是从前一时刻哪一个隐状态转移过来的, 最后再从结尾回溯最大概率, 也就是最有可能的最优路径. 这话对于没有学过维特比算法的同学是无法理解的, 但是我觉得今天学完维特比算法之后再来看这句话, 可以加深记忆.
我们这里为了学习维特比方便, 所以转换一下标签:

$A_{i, j}^{t-1, t}$ , 是转移概率矩阵 $A$ 中的第 $i$ 行第 $j$ 列(下标), 指的是在 $t - 1$ 时刻实体标签为 $q_i$ , 而在 $t$ 时刻实体标签转换到 $q_j$ 的概率.
$B_{jk}$ 是发射矩阵的第j行第k列, 指的是在第 $t$ 时刻, 由隐状态 $q_j$ 生成观测 $v_k$ 的概率.
有了上面两点, 则 $\hat{q}_j = A_{ij}B_{jk}$ 表示在 $t$ 时刻的隐状态为 $q_j$ 的概率估计.

在这里我们直接以实例的方式来说明维特比算法的计算过程(注意我们在这里下标从 $0$ 开始算起):

假设我们现在有所有可能的观测结果的集合 $V_{obs.}=\{v_0, v_1\}$ ;
所有可能的隐状态的集合 $Q_{hidden}=\{q_0, q_1, q_2\}$ ;
已经观测到的观测结果序列 $O=(o_1=v_0, \ o_2=v_1, \ o_3 = v_0)$ ;
然后假设我们通过HMM建模并学习, 得到了模型所估计的参数 $\pi)$ , 注意下面的 $A, B$ 矩阵按行求和为 $1$ ;
我们要求出对应当前观测结果 $O$ 的最有可能的隐状态序列 $I=(i_0, i_1, i_2)$ .
我们现在要初始化两个暂存表格, 来暂存我们在每一时刻的计算结果, 稍后我们会说明怎么使用这两个表, 下面我们看到T1表格和T2表格, 他们的规格都是 $num\_hidden\_states * sequence\_length$ , 这两个表格在每一时刻 $t$ 都由 $3$ 个方块组成, $3$ 是所有可能隐状态的个数, 即 $Q_{hidden}|=3$ , 注意这里表格内填充的颜色无意义, 只有好看的作用.

计算过程:

首先我们有初始隐状态概率矩阵 $\pi$ , 和第1时刻的观测结果 $o_1=v_0$ , 则在第一时刻, 由隐状态生成观测结果的概率计算可以写成 $q_j^{t=1} = \pi_{j}B_{jk}$ .

我们现在说明 $T 1, T 2$ 表格的用途: 如果 $T 1, T 2$ 表格是 $i * j$ 的矩阵, 则矩阵中第 $j$ 列指的是第 $j$ 时刻, 第 $i$ 行指的是第 $i$ 种隐状态, $\ j]$ 指的是在第 $j$ 时刻, 落到隐状态 $i$ 的最大可能的概率是多少(不要着急, 到了下一个时刻就会明白最大是什么意思), 而 $\ j]$ 记录的是这个最大可能的概率是从第 $j - 1$ 时刻(上一时刻)的哪一种隐状态 $i$ 转移过来的, 也就是说我们记录的是最大可能的概率的转移路径.
我们现在将第一时刻的计算结果填入 $T 1, T 2$ 表格, 注意在第 $0$ 时刻的隐状态是由初始隐状态概率矩阵提供的, 而不是从上一时刻的隐状态转移过来的, 所以我们直接在 $T 2$ 表格上记为 $\ a \ number)$
在这里插入图片描述

我们现在来到第 $1$ 时刻(时刻下标从 $0$ 起算), 首先我们先计算 $T 1 [i = 0, j = 1]$ (也就是第 $j = 1$ 时刻, 落到隐状态 $i=q_0$ 上的最大可能的概率是多少), 我们可以看出, 从上一时刻到当前时刻, 要想让当前时刻的隐状态为 $i_1=q_0$ , 则有3条路径可走, 分别是: $i_0=q_0, i_1=q_0), \ (i_0=q_1, i_1=q_0), \ (i_0=q_2, i_1=q_0)$ ,
我们在 $T 1 [i = 0, j = 1]$ 的位置就是要算出, 这三条路径哪一条是最有可能的路径, 也就是取概率最大的一条, 这样的话, 计算公式为:
$T1[0, 1]=\max_{i} (P(i_1 = q_0 | i_0 = q_i) P(o_1=v_1| i_1 = q_0)) = T1[q_i, time\_step=0] * A_{t-1=q_i, \ t=q_0} * B_{i_1 = q_0, o_1=v_1}$
上式最右边 $T1[q_i, time\_step=0]$ 也就是 $\ 0]$ 的意思是在 $t - 1$ 时刻(也就是上一时刻), 每个隐状态对应的概率, 是长度为 $3$ 的向量;
$A_{t-1=q_i, \ t=q_0}$ 是 $A$ 矩阵的第 $i$ 行第 $0$ 列, 指的是在 $t - 1$ 时刻隐状态为 $q_i$ , 而在 $t$ 时刻隐状态为 $q_0$ 的概率, 一共有三种可能的路径, 所以也是长度为 $3$ 的向量;
$B_{i_1 = q_0, o_1=v_1}$ 是 $B$ 矩阵的第 $0$ 行第 $1$ 列, 指的是隐状态 $q_0$ 生成观测 $v_1$ 的概率, 是一个数值.
通过查表计算, 我们算出:
$T1[0,1]=max\{0.10 * 0.5 * 0.5, \ 0.16 * 0.3* 0.5, \ 0.28*0.2* 0.5\}=0.028$
我们之前说过, 我们还要知道目前计算出来的这个最大可能的概率前一时刻的哪一种隐状态 $i$ 转移过来的, 也就是我们要在 $T 2 [0, 1]$ 记录转移路径, 计算公式为:
$T2[0,1]=argmax\{0.10 * 0.5 * 0.5, \ 0.16 * 0.3* 0.5, \ 0.28*0.2* 0.5\}=2$
我们把计算结果填到表里, 注意在下图中, 红色的线表示最大的转移路径, 是从前一时刻的 $q_2$ 转移过来的.
接下来我们用同样的方式, 把表填完, 下面我们开始讲维特比算法是怎样通过这些暂存的概率和路径找到最优路径的:

最优路径有以下特性: 假设我们有一条最优路径在 $t$ 时刻通过一个隐状态 $i_t$ , 那么这一路径从 $i_t$ 到最优路径的终点 $i_T$ 相对于在这段距离里所有可能出现的路径里, 也必须是最优的. 否则从 $i_t$ 到 $i_T$ 就会有更优的一条路径, 如果把他和从 $i_1$ 到 $i_t$ 的路径(最优路径 $i_t$ 之前的部分)连起来, 等于我们又有一条更优路径, 这是矛盾的.
利用这一特性, 我们只要按上面的步骤计算直到得出最后一步达到的最大概率的隐状态, 再确认最大概率是从前一步哪一个隐状态转移过来的, 然后从 $T 2$ 表格里面递推回溯直到第一时刻(也就是 $N A N$ 的地方), 就可以找出最优路径了.
回溯的计算:

首先算出最后一步达到最大路径的隐状态, 也就是在 $T 1$ 表格的第 $3$ 列求 $a r g m a x$ :
$i_2 = argmax \ T1[:, \ time\_step = 2] = 2$
之后我们通过 $T 2$ 表格向前追溯一步, 当前最大概率是从前一步哪个隐状态转移过来的:
$i_1 = T2[i_2 = 2, \ time\_step = 2] = 2$
我们到达了倒数第一步, 我们追溯最优路径是从哪个起始隐状态转移过来的:
$i_0 = T2[i_1 = 2, \ time\_step = 1] = 2$
至此我们得出了最有可能的隐状态序列:
$I=(q_2, \ q_2, \ q_2)$

结论:

时间复杂度: 假设我们有 $N$ 种隐状态, 在每个时刻之间, 一共可能的路径一共有 $N^2$ 种, 假设我们有 $T$ 个时刻, 则维特比算法的时间复杂度为 $O(TN^2)$ .
在实际的预测计算当中, 为了防止计算结果下溢, 我们通常将乘法变为取对数之后的加法.
具体范例代码见视频讲解.

# 下面代码部分为维特比算法中正向递推算法的矩阵化算法,
# 即从t-1时刻到t时刻求出需要填入T1和T2表暂存的算法,
# 这里计算的是上例从第0时刻到第1时刻的计算过程,
# 具体讲解参见教学视频
import numpy as np
A = np.array([
    [.5, .2, .3],
    [.3, .5, .2],
    [.2, .3, .5]
])
B = b = np.array([
    [.5,.5],
    [.4,.6],
    [.7,.3]
])
pi = np.array([
    [.2],
    [.4],
    [.4]
])
print("transitions: A")
print(A)
print("emissions: B")
print(B)
print("pi:")
print(pi)

transitions: A
[[0.5 0.2 0.3]
 [0.3 0.5 0.2]
 [0.2 0.3 0.5]]
emissions: B
[[0.5 0.5]
 [0.4 0.6]
 [0.7 0.3]]
pi:
[[0.2]
 [0.4]
 [0.4]]

T1_prev = np.array([0.1, 0.16, 0.28])
T1_prev = np.expand_dims(T1_prev, axis=-1)
print(T1_prev)
print(T1_prev.shape)

[[0.1 ]
 [0.16]
 [0.28]]
(3, 1)

# 因为第1时刻的观测为v_1, 所以取B矩阵的第1列, 即所有隐状态生成观测v_1的概率
p_Obs_State = B[:, 1]
p_Obs_State = np.expand_dims(p_Obs_State, axis=0)
print(p_Obs_State)
print(p_Obs_State.shape)

[[0.5 0.6 0.3]]
 (1, 3)

T1_prev * p_Obs_State * A

array([[0.025 , 0.012 , 0.009 ],
       [0.024 , 0.048 , 0.0096],
       [0.028 , 0.0504, 0.042 ]])

# 在行的维度求max
np.max(T1_prev * p_Obs_State * A, axis=0)

array([0.028 , 0.0504, 0.042 ])

# 看看所得的max概率的路径是从哪里来的, 在上一步从哪个隐状态转移过来的
np.argmax(T1_prev * p_Obs_State * A, axis=0)

array([2, 2, 2])

参考资料:

中文命名实体识别标注数据: https://github.com/SophonPlus/ChineseNlpCorpus
统计学习方法 (第2版) 李航著 193页第十章隐马尔可夫模型
wikipedia Viterbi algorithm https://en.wikipedia.org/wiki/Viterbi_algorithm
wikipedia Hidden Markov model https://en.wikipedia.org/wiki/Hidden_Markov_model

HMM / viterbi算法

import numpy as np
from utils import *
from tqdm import tqdm


class HMM_NER:
    def __init__(self, char2idx_path, tag2idx_path):
        # 载入一些字典
        # char2idx: 字 转换为 token
        self.char2idx = load_dict(char2idx_path)
        # tag2idx: 标签转换为 token
        self.tag2idx = load_dict(tag2idx_path)
        # idx2tag: token转换为标签
        self.idx2tag = {v: k for k, v in self.tag2idx.items()}
        # 初始化隐状态数量(实体标签数)和观测数量(字数)
        self.tag_size = len(self.tag2idx)
        self.vocab_size = max([v for _, v in self.char2idx.items()]) + 1
        # 初始化A, B, pi为全0
        self.transition = np.zeros([self.tag_size,
                                    self.tag_size])
        self.emission = np.zeros([self.tag_size,
                                  self.vocab_size])
        self.pi = np.zeros(self.tag_size)
        # 偏置, 用来防止log(0)或乘0的情况
        self.epsilon = 1e-8

    def fit(self, train_dic_path):
        """
        fit用来训练HMM模型
        :param train_dic_path: 训练数据目录
        """
        print("initialize training...")
        train_dic = load_data(train_dic_path)
        # 估计转移概率矩阵, 发射概率矩阵和初始概率矩阵的参数
        self.estimate_transition_and_initial_probs(train_dic)
        self.estimate_emission_probs(train_dic)
        # take the logarithm
        # 取log防止计算结果下溢
        self.pi = np.log(self.pi)
        self.transition = np.log(self.transition)
        self.emission = np.log(self.emission)
        print("DONE!")


    def estimate_emission_probs(self, train_dic):
        """
        发射矩阵参数的估计
        estimate p( Observation | Hidden_state )
        :param train_dic:
        :return:
        """
        print("estimating emission probabilities...")
        for dic in tqdm(train_dic):
            for char, tag in zip(dic["text"], dic["label"]):
                self.emission[self.tag2idx[tag],
                              self.char2idx[char]] += 1
        self.emission[self.emission == 0] = self.epsilon
        self.emission /= np.sum(self.emission, axis=1, keepdims=True)


    def estimate_transition_and_initial_probs(self, train_dic):
        """
        转移矩阵和初始概率的参数估计, 也就是bigram二元模型
        estimate p( Y_t+1 | Y_t )
        :param train_dic:
        :return:
        """
        print("estimating transition and initial probabilities...")
        for dic in tqdm(train_dic):
            for i, tag in enumerate(dic["label"][:-1]):
                if i == 0:
                    self.pi[self.tag2idx[tag]] += 1
                curr_tag = self.tag2idx[tag]
                next_tag = self.tag2idx[dic["label"][i+1]]
                self.transition[curr_tag, next_tag] += 1
        self.transition[self.transition == 0] = self.epsilon
        self.transition /= np.sum(self.transition, axis=1, keepdims=True)
        self.pi[self.pi == 0] = self.epsilon
        self.pi /= np.sum(self.pi)

    def get_p_Obs_State(self, char):
        # 计算p( observation | state)
        # 如果当前字属于未知, 则讲p( observation | state)设为均匀分布
        char_token = self.char2idx.get(char, 0)
        if char_token == 0:
            return np.log(np.ones(self.tag_size)/self.tag_size)
        return np.ravel(self.emission[:, char_token])

    def predict(self, text):
        # 预测并打印出预测结果
        # 维特比算法解码
        if len(text) == 0:
            raise NotImplementedError("输入文本为空!")
        best_tag_id = self.viterbi_decode(text)
        self.print_func(text, best_tag_id)

    def print_func(self, text, best_tags_id):
        # 用来打印预测结果
        for char, tag_id in zip(text, best_tags_id):
            print(char+"_"+self.idx2tag[tag_id]+"|", end="")

    def viterbi_decode(self, text):
        """
        维特比解码, 详见视频教程或文字版教程
        :param text: 一段文本string
        :return: 最可能的隐状态路径
        """
        # 得到序列长度
        seq_len = len(text)
        # 初始化T1和T2表格
        T1_table = np.zeros([seq_len, self.tag_size])
        T2_table = np.zeros([seq_len, self.tag_size])
        # 得到第1时刻的发射概率
        start_p_Obs_State = self.get_p_Obs_State(text[0])
        # 计算第一步初始概率, 填入表中
        T1_table[0, :] = self.pi + start_p_Obs_State
        T2_table[0, :] = np.nan

        for i in range(1, seq_len):
            # 维特比算法在每一时刻计算落到每一个隐状态的最大概率和路径
            # 并把他们暂存起来
            # 这里用到了矩阵化计算方法, 详见视频教程
            p_Obs_State = self.get_p_Obs_State(text[i])
            p_Obs_State = np.expand_dims(p_Obs_State, axis=0)
            prev_score = np.expand_dims(T1_table[i-1, :], axis=-1)
            # 广播算法, 发射概率和转移概率广播 + 转移概率
            curr_score = prev_score + self.transition + p_Obs_State
            # 存入T1 T2中
            T1_table[i, :] = np.max(curr_score, axis=0)
            T2_table[i, :] = np.argmax(curr_score, axis=0)
        # 回溯
        best_tag_id = int(np.argmax(T1_table[-1, :]))
        best_tags = [best_tag_id, ]
        for i in range(seq_len-1, 0, -1):
            best_tag_id = int(T2_table[i, best_tag_id])
            best_tags.append(best_tag_id)
        return list(reversed(best_tags))

if __name__ == '__main__':
    model = HMM_NER(char2idx_path="./dicts/char2idx.json",
                    tag2idx_path="./dicts/tag2idx.json")
    model.fit("./corpus/train_data.txt")
    model.predict("我在中国吃美国的面包")