别再死记硬背BEMS了!用Python手把手带你实现HMM中文分词(附完整代码)
·
从零实现HMM中文分词:Python实战与数学原理深度解析
在自然语言处理领域,中文分词一直是个有趣且具有挑战性的任务。与英文不同,中文没有天然的空格分隔,如何让计算机理解"武汉市长江大桥"应该分成"武汉/市长/江大桥"还是"武汉市/长江/大桥",这背后需要智能算法的支持。今天,我们将抛开枯燥的理论推导,直接动手用Python实现一个基于隐马尔可夫模型(HMM)的中文分词器。
1. 理解HMM分词的核心概念
HMM模型认为,每个汉字在词中的位置(BEMS标签)是隐藏状态,而汉字本身是可观测的输出。我们需要计算三个关键概率:
- 初始概率 :句子第一个字处于B、M、E、S状态的概率
- 转移概率 :从一个状态转移到另一个状态的概率(如B→E)
- 发射概率 :在某个状态下观察到特定汉字的概率
注意:B代表词首,M代表词中,E代表词尾,S代表单字词。例如"苹果"标注为BE,"中国"标注为BE,"啊"标注为S。
让我们用一个微型语料库来直观理解:
corpus = [
("我爱北京天安门", "SBEBEBME"),
("天安门广场", "BEME"),
("你好", "BE")
]
2. 数据准备与概率统计
2.1 构建统计计数器
我们需要三个计数器来统计概率:
from collections import defaultdict
import numpy as np
# 初始化计数器
init_prob = defaultdict(int) # 初始概率
trans_prob = defaultdict(int) # 转移概率
emit_prob = defaultdict(lambda: defaultdict(int)) # 发射概率
state_count = defaultdict(int) # 状态出现总次数
# 统计函数
def update_counts(sentence, states):
# 初始概率
init_prob[states[0]] += 1
# 转移概率
for i in range(len(states)-1):
trans_prob[(states[i], states[i+1])] += 1
# 发射概率和状态计数
for char, state in zip(sentence, states):
emit_prob[state][char] += 1
state_count[state] += 1
2.2 概率计算与平滑处理
直接统计会出现零概率问题,我们需要加入拉普拉斯平滑:
def calculate_probabilities():
# 初始概率
total_init = sum(init_prob.values())
init = {s: (init_prob.get(s, 0) + 1) / (total_init + 4)
for s in ['B', 'M', 'E', 'S']}
# 转移概率
trans = {}
for prev in ['B', 'M', 'E', 'S']:
total = sum(trans_prob.get((prev, next_s), 0) for next_s in ['B', 'M', 'E', 'S'])
trans[prev] = {next_s: (trans_prob.get((prev, next_s), 0) + 1) / (total + 4)
for next_s in ['B', 'M', 'E', 'S']}
# 发射概率
emit = {}
all_chars = set(char for state in emit_prob for char in emit_prob[state])
for state in ['B', 'M', 'E', 'S']:
total = state_count.get(state, 0)
emit[state] = {char: (emit_prob[state].get(char, 0) + 1) / (total + len(all_chars))
for char in all_chars}
return init, trans, emit
3. Viterbi算法实现
Viterbi算法用于找到最可能的状态序列:
def viterbi(sentence, init_prob, trans_prob, emit_prob):
# 初始化
V = [{}]
path = {}
# 初始步
for state in ['B', 'M', 'E', 'S']:
V[0][state] = init_prob[state] * emit_prob[state].get(sentence[0], 1e-10)
path[state] = [state]
# 递推
for t in range(1, len(sentence)):
V.append({})
new_path = {}
for curr_state in ['B', 'M', 'E', 'S']:
max_prob = -1
best_prev_state = None
for prev_state in ['B', 'M', 'E', 'S']:
prob = V[t-1][prev_state] * trans_prob[prev_state][curr_state] * emit_prob[curr_state].get(sentence[t], 1e-10)
if prob > max_prob:
max_prob = prob
best_prev_state = prev_state
V[t][curr_state] = max_prob
new_path[curr_state] = path[best_prev_state] + [curr_state]
path = new_path
# 终止
max_prob = -1
best_path = []
for state in ['B', 'M', 'E', 'S']:
if V[-1][state] > max_prob:
max_prob = V[-1][state]
best_path = path[state]
return best_path
4. 完整分词器实现与优化
将上述组件整合成一个完整的分词器类:
class HMMSegmenter:
def __init__(self):
self.init_prob = None
self.trans_prob = None
self.emit_prob = None
def train(self, corpus):
# 初始化计数器
init_prob = defaultdict(int)
trans_prob = defaultdict(int)
emit_prob = defaultdict(lambda: defaultdict(int))
state_count = defaultdict(int)
# 统计计数
for sentence, states in corpus:
# 初始概率
init_prob[states[0]] += 1
# 转移概率
for i in range(len(states)-1):
trans_prob[(states[i], states[i+1])] += 1
# 发射概率
for char, state in zip(sentence, states):
emit_prob[state][char] += 1
state_count[state] += 1
# 计算概率(带平滑)
# 初始概率
total_init = sum(init_prob.values())
self.init_prob = {s: (init_prob.get(s, 0) + 1) / (total_init + 4)
for s in ['B', 'M', 'E', 'S']}
# 转移概率
self.trans_prob = {}
for prev in ['B', 'M', 'E', 'S']:
total = sum(trans_prob.get((prev, next_s), 0)
for next_s in ['B', 'M', 'E', 'S'])
self.trans_prob[prev] = {
next_s: (trans_prob.get((prev, next_s), 0) + 1) / (total + 4)
for next_s in ['B', 'M', 'E', 'S']
}
# 发射概率
self.emit_prob = {}
all_chars = set(char for state in emit_prob for char in emit_prob[state])
for state in ['B', 'M', 'E', 'S']:
total = state_count.get(state, 0)
self.emit_prob[state] = {
char: (emit_prob[state].get(char, 0) + 1) / (total + len(all_chars))
for char in all_chars
}
def segment(self, sentence):
if not self.init_prob:
raise ValueError("Model not trained yet!")
# Viterbi算法获取最佳状态序列
best_path = self._viterbi(sentence)
# 根据状态序列进行分词
words = []
start = 0
for i in range(len(best_path)):
if best_path[i] in ['S', 'E']:
words.append(sentence[start:i+1])
start = i+1
return words
def _viterbi(self, sentence):
# 初始化
V = [{}]
path = {}
# 初始步
for state in ['B', 'M', 'E', 'S']:
V[0][state] = self.init_prob[state] * self.emit_prob[state].get(sentence[0], 1e-10)
path[state] = [state]
# 递推
for t in range(1, len(sentence)):
V.append({})
new_path = {}
for curr_state in ['B', 'M', 'E', 'S']:
max_prob = -1
best_prev_state = None
for prev_state in ['B', 'M', 'E', 'S']:
prob = V[t-1][prev_state] * \
self.trans_prob[prev_state][curr_state] * \
self.emit_prob[curr_state].get(sentence[t], 1e-10)
if prob > max_prob:
max_prob = prob
best_prev_state = prev_state
V[t][curr_state] = max_prob
new_path[curr_state] = path[best_prev_state] + [curr_state]
path = new_path
# 终止
max_prob = -1
best_path = []
for state in ['B', 'M', 'E', 'S']:
if V[-1][state] > max_prob:
max_prob = V[-1][state]
best_path = path[state]
return best_path
5. 实战测试与性能提升
5.1 基础测试
让我们用一个小型语料库测试我们的分词器:
# 训练数据
train_data = [
("我爱北京天安门", "SBEBEBME"),
("天安门广场", "BEME"),
("你好", "BE"),
("清华大学", "BMME"),
("研究生命", "BMME"),
("他说的确实在理", "SBEBESBME")
]
# 初始化并训练模型
segmenter = HMMSegmenter()
segmenter.train(train_data)
# 测试分词
test_sentences = ["我爱清华大学", "研究生命科学", "天安门广场很壮观"]
for sent in test_sentences:
print(f"'{sent}' 分词结果: {segmenter.segment(sent)}")
5.2 性能优化技巧
在实际应用中,我们还需要考虑以下优化点:
- 对数概率计算 :
- 避免连乘导致的下溢问题
- 将概率相乘转换为对数概率相加
# 修改Viterbi算法中的概率计算
prob = V[t-1][prev_state] + \
np.log(self.trans_prob[prev_state][curr_state]) + \
np.log(self.emit_prob[curr_state].get(sentence[t], 1e-10))
-
未知词处理 :
- 对于训练集中未出现的汉字,采用回退策略
- 可以使用汉字的结构特征(偏旁部首)或字符类别
-
模型持久化 :
- 将训练好的模型参数保存到文件
- 避免每次使用时重新训练
import pickle
# 保存模型
def save_model(model, filename):
with open(filename, 'wb') as f:
pickle.dump({
'init_prob': model.init_prob,
'trans_prob': model.trans_prob,
'emit_prob': model.emit_prob
}, f)
# 加载模型
def load_model(filename):
with open(filename, 'rb') as f:
params = pickle.load(f)
model = HMMSegmenter()
model.init_prob = params['init_prob']
model.trans_prob = params['trans_prob']
model.emit_prob = params['emit_prob']
return model
6. 扩展与应用场景
HMM分词器虽然原理简单,但在许多场景下仍然表现良好:
- 垂直领域分词 :针对特定领域(医疗、法律等)训练专用模型
- 预处理工具 :作为更复杂NLP系统的预处理组件
- 教育工具 :帮助学习者理解分词和序列标注的基本概念
对于追求更高准确率的场景,可以考虑以下扩展方向:
- 与词典结合 :混合基于词典和统计的方法
- 深度学习模型 :使用BiLSTM-CRF等神经网络模型
- 多模型融合 :结合多种分词算法的结果
在实际项目中,我发现HMM模型特别适合处理以下情况:
- 训练数据量有限时
- 需要快速原型开发时
- 系统资源受限的环境中
一个实用的建议是:对于生产环境,可以先从HMM这样的基础模型开始,建立基线性能,然后再逐步引入更复杂的模型。这样既能快速验证想法,又能为后续优化提供明确的方向。
更多推荐



所有评论(0)