Jieba、NLTK等中英文分词工具进行分词

实验目的：利用给定的中英文文本序列（见 Chinese.txt 和 English.txt），分别利用以下给定的中英文分词工具进行分词并对不同分词工具产生的结果进行简要对比分析。实验工具：中文 Jieba（重点），尝试三种分词模式与自定义词典功能、SnowNLP、THULAC、NLPIR、StanfordCoreNLP、英文 NLTK、SpaCy、StanfordCore...

木禾DING

30419人浏览 · 2019-03-20 23:03:40

木禾DING · 2019-03-20 23:03:40 发布

实验目的：

利用给定的中英文文本序列（见 Chinese.txt 和 English.txt），分别利用以下给定的中

英文分词工具进行分词并对不同分词工具产生的结果进行简要对比分析。

实验工具：

中文 Jieba（重点），尝试三种分词模式与自定义词典功能、SnowNLP、THULAC、NLPIR、StanfordCoreNLP、

英文 NLTK、SpaCy、StanfordCoreNLP

实验环境：

语言：Python 3.7.0

IDE： Pycharm

需要使用 pip 安装很多包，这里请大家去搜索相关教程安装

实验步骤：

首先进行中文分词：

一、jieba

import jieba
import re
Chinese=‘央视315晚会曝光湖北省知名的神丹牌、莲田牌“土鸡蛋”实为普通鸡蛋冒充，同时在商标上玩猫腻，分别注册“鲜土”、注册“好土”商标，让消费者误以为是“土鸡蛋”。3月15日晚间，新京报记者就此事致电湖北神丹健康食品有限公司方面，其工作人员表示不知情，需要了解清楚情况，截至发稿暂未取得最新回应。新京报记者还查询发现，湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业，此前曾因涉嫌虚假宣传“中国最大的蛋品企业”而被罚6万元。’

str=re.sub('[^\w]','',chinese)   #使用正则去符号，之后都是用这个str字符串

seg_list=jieba.cut(s_list, cut_all=False) #精确模式
print('/'.join(seg_list))

分词的结果：

央视/315/晚会/曝光/湖北省/知名/的/神丹/牌莲田牌/土/鸡蛋/实为/普通/鸡蛋/冒充/同时/在/商标/上/玩/猫腻/分别/注册/鲜土/注册/好土/商标/让/消费者/误以为/是/土/鸡蛋/3/月/15/日/晚间/新/京报/记者/就/此事/致电/湖北/神丹/健康/食品/有限公司/方面/其/工作人员/表示/不知情/需要/了解/清楚/情况/截至/发稿/暂未/取得/最新/回应/新/京报/记者/还/查询/发现/湖北/神丹/健康/食品/有限公司/为/农业/产业化/国家/重点/龙头企业/高新技术/企业/此前/曾/因涉嫌/虚假/宣传/中国/最大/的/蛋品/企业/而/被/罚/6/万元

载入自创建的词典

使用 jieba.load_userdict(file)

file=open(dict,'r')   # 载入一个词典，这个词典的内容为：神丹牌、莲花牌、土鸡蛋、新京报
jieba.load_userdict(file)
file.close()
seg_list=jieba.cut(str, cut_all=False) #精确模式  str 为之前的字符串
print('/'.join(seg_list))

结果：

央视/315/晚会/曝光/湖北省/知名/的/神丹牌/莲田牌/土鸡蛋/实为/普通/鸡蛋/冒充/同时/在/商标/上/玩/猫腻/分别/注册/鲜土/注册/好土/商标/让/消费者/误以为/是/土鸡蛋/3/月/15/日/晚间/新京报/记者/就/此事/致电/湖北/神丹/健康/食品/有限公司/方面/其/工作人员/表示/不知情/需要/了解/清楚/情况/截至/发稿/暂未/取得/最新/回应/新京报/记者/还/查询/发现/湖北/神丹/健康/食品/有限公司/为/农业/产业化/国家/重点/龙头企业/高新技术/企业/此前/曾/因涉嫌/虚假/宣传/中国/最大/的/蛋品/企业/而/被/罚/6/万元

显然载入词典之后，神丹牌、莲花牌、土鸡蛋、新京报合在了一起

二、SnowNlp

from snownlp import SnowNLP
s=SnowNLP(str)   #str为之前去掉符号的中文字符串
print(s.words)   #进行分词
print(s.pinyin)  #得到拼音
print(s.summary(3)) #进行总结 summary
print(s.keywords(3)) # 得到关键词
print(s.han)  #把繁体字变成简体字

分词的结果：是一个列表

['央视', '315', '晚会', '曝光', '湖北省', '知名', '的', '神丹', '牌', '莲', '田', '牌', '土', '鸡蛋', '实', '为', '普通', '鸡蛋', '冒充', '同时', '在', '商标', '上', '玩猫', '腻', '分别', '注册', '鲜', '土', '注册', '好', '土', '商标', '让', '消费者', '误', '以为', '是', '土', '鸡蛋', '3', '月', '15', '日', '晚间', '新京', '报', '记者', '就', '此事', '致电', '湖北', '神', '丹', '健康', '食品', '有限公司', '方面', '其', '工作', '人员', '表示', '不', '知情', '需要', '了解', '清楚', '情况', '截至', '发稿', '暂', '未', '取得', '最新', '回应', '新京', '报', '记者', '还', '查询', '发现', '湖北', '神', '丹', '健康', '食品', '有限公司', '为', '农业', '产业化', '国家', '重点', '龙头', '企业', '高新技术', '企业', '此前', '曾', '因', '涉嫌', '虚假', '宣传', '中国', '最', '大', '的', '蛋品', '企业', '而', '被', '罚', '6', '万', '元']

三、Thulac

t=thulac.thulac()  #进行分词和标注词性
text=t.cut(str,text=False) #进行分词和标注词性,若text=True 则为 str，否则为默认模式返回值为list
print(text)

[['央视', 'v'], ['315', 'm'], ['晚会', 'n'], ['曝光', 'v'], ['湖北省', 'ns'], ['知名', 'a'], ['的', 'u'], ['神丹牌', 'nz'], ['莲田牌', 'nz'], ['土鸡蛋', 'n'], ['实', 'a'], ['为', 'v'], ['普通', 'a'], ['鸡蛋', 'n'], ['冒充', 'v'], ['同时', 'd'], ['在', 'p'], ['商标', 'n'], ['上', 'f'], ['玩', 'v'], ['猫腻', 'n'], ['分别', 'd'], ['注册', 'v'], ['鲜土', 'n'], ['注册', 'v'], ['好', 'a'], ['土', 'n'], ['商标', 'n'], ['让', 'v'], ['消费者', 'n'], ['误', 'd'], ['以为', 'v'], ['是', 'v'], ['土鸡蛋', 'n'], ['3月', 't'], ['15日', 't'], ['晚间', 't'], ['新', 'a'], ['京报', 'n'], ['记者', 'n'], ['就', 'p'], ['此事', 'r'], ['致电', 'v'], ['湖北', 'ns'], ['神丹', 'nz'], ['健康', 'a'], ['食品', 'n'], ['有限公司', 'n'], ['方面', 'n'], ['其', 'r'], ['工作', 'v'], ['人员', 'n'], ['表示', 'v'], ['不', 'd'], ['知情', 'v'], ['需要', 'v'], ['了', 'u'], ['解', 'v'], ['清楚', 'a'], ['情况', 'n'], ['截至', 'v'], ['发稿', 'v'], ['暂', 'd'], ['未', 'd'], ['取得', 'v'], ['最新', 'a'], ['回应', 'v'], ['新', 'a'], ['京报', 'n'], ['记者', 'n'], ['还', 'd'], ['查询', 'v'], ['发现', 'v'], ['湖北', 'ns'], ['神丹', 'nz'], ['健康', 'a'], ['食品', 'n'], ['有限公司', 'n'], ['为', 'p'], ['农业', 'n'], ['产业化', 'v'], ['国', 'm'], ['家', 'q'], ['重点', 'n'], ['龙头', 'n'], ['企业', 'n'], ['高新技术', 'n'], ['企业', 'n'], ['此前', 't'], ['曾', 'd'], ['因', 'p'], ['涉嫌', 'v'], ['虚假', 'a'], ['宣传', 'v'], ['中国', 'ns'], ['最', 'd'], ['大', 'a'], ['的', 'u'], ['蛋品', 'n'], ['企业', 'n'], ['而', 'c'], ['被', 'p'], ['罚', 'v'], ['6万', 'm'], ['元', 'q']]

若

t2=thulac.thulac(seg_only=True)  #只进行分词 segment

则只进行分词，不标注词性

四、Pynlpir

pynlpir.open()
print(pynlpir.segment(str)) #分词

[('央', 'verb'), ('视', 'verb'), ('315', 'numeral'), ('晚会', 'noun'), ('曝光', 'verb'), ('湖北省', 'noun'), ('知名', 'adjective'), ('的', 'particle'), ('神', 'noun'), ('丹', 'distinguishing word'), ('牌', 'noun'), ('、', 'punctuation mark'), ('莲', 'noun'), ('田', 'noun'), ('牌', 'noun'), ('“', 'punctuation mark'), ('土', 'noun'), ('鸡蛋', 'noun'), ('”', 'punctuation mark'), ('实', 'adjective'), ('为', 'verb'), ('普通', 'adjective'), ('鸡蛋', 'noun'), ('冒充', 'verb'), ('，', 'punctuation mark'), ('同时', 'conjunction'), ('在', 'preposition'), ('商标', 'noun'), ('上', 'noun of locality'), ('玩', 'verb'), ('猫腻', 'noun'), ('，', 'punctuation mark'), ('分别', 'adverb'), ('注册', 'verb'), ('“', 'punctuation mark'), ('鲜', 'adjective'), ('土', 'noun'), ('”', 'punctuation mark'), ('、', 'punctuation mark'), ('注册', 'verb'), ('“', 'punctuation mark'), ('好', 'adjective'), ('土', 'noun'), ('”', 'punctuation mark'), ('商标', 'noun'), ('，', 'punctuation mark'), ('让', 'verb'), ('消费者', 'noun'), ('误', 'adverb'), ('以为', 'verb'), ('是', 'verb'), ('“', 'punctuation mark'), ('土', 'noun'), ('鸡蛋', 'noun'), ('”', 'punctuation mark'), ('。', 'punctuation mark'), ('3月', 'time word'), ('15日', 'time word'), ('晚间', 'time word'), ('，', 'punctuation mark'), ('新京报', None), ('记者', 'noun'), ('就', 'adverb'), ('此事', 'pronoun'), ('致电', 'verb'), ('湖北', 'noun'), ('神', 'noun'), ('丹', 'distinguishing word'), ('健康', 'adjective'), ('食品', 'noun'), ('有限公司', 'noun'), ('方面', 'noun'), ('，', 'punctuation mark'), ('其', 'pronoun'), ('工作', 'verb'), ('人员', 'noun'), ('表示', 'verb'), ('不', 'adverb'), ('知', 'verb'), ('情', 'noun'), ('，', 'punctuation mark'), ('需要', 'verb'), ('了解', 'verb'), ('清楚', 'adjective'), ('情况', 'noun'), ('，', 'punctuation mark'), ('截至', 'verb'), ('发稿', 'verb'), ('暂', 'adverb'), ('未', 'adverb'), ('取得', 'verb'), ('最新', 'adjective'), ('回应', 'verb'), ('。', 'punctuation mark'), ('新京报', None), ('记者', 'noun'), ('还', 'adverb'), ('查询', 'verb'), ('发现', 'verb'), ('，', 'punctuation mark'), ('湖北', 'noun'), ('神', 'noun'), ('丹', 'distinguishing word'), ('健康', 'adjective'), ('食品', 'noun'), ('有限公司', 'noun'), ('为', 'preposition'), ('农业', 'noun'), ('产业化', 'verb'), ('国家', 'noun'), ('重点', 'noun'), ('龙头', 'noun'), ('企业', 'noun'), ('、', 'punctuation mark'), ('高新技术', 'noun'), ('企业', 'noun'), ('，', 'punctuation mark'), ('此前', 'time word'), ('曾', 'adverb'), ('因', 'preposition'), ('涉嫌', 'verb'), ('虚假', 'adjective'), ('宣传', 'verb'), ('“', 'punctuation mark'), ('中国', 'noun'), ('最', 'adverb'), ('大', 'adjective'), ('的', 'particle'), ('蛋品', 'noun'), ('企业', 'noun'), ('”', 'punctuation mark'), ('而', 'conjunction'), ('被', 'preposition'), ('罚', 'verb'), ('6万', 'numeral'), ('元', 'classifier'), ('。', 'punctuation mark')]

五、StanfordCoreNLP：

nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='zh')
print(nlp.word_tokenize(s_list)) #返回一个列表
# print(nlp.pos_tag(str))  #词性标注
# print(nlp.parse(str))  #解析

结果：

['央视', '315', '晚会', '曝光', '湖北省', '知名', '的', '神丹', '牌', '莲', '田', '牌', '土', '鸡蛋', '实为', '普通', '鸡蛋', '冒充', '同时', '在', '商标', '上', '玩', '猫腻', '分别', '注册', '鲜土', '注册', '好', '土', '商标', '让', '消费者', '误以为', '是', '土', '鸡蛋', '3月', '15日', '晚间', '新京报', '记者', '就此事', '致电', '湖北', '神丹', '健康', '食品', '有限', '公司', '方面', '其', '工作', '人员', '表示', '不知情', '需要', '了解', '清楚', '情况', '截至', '发稿', '暂', '未', '取得', '最新', '回应', '新京报', '记者', '还', '查询', '发现', '湖北', '神丹', '健康', '食品', '有限', '公司', '为', '农业', '产业化', '国家', '重点', '龙头', '企业', '高', '新', '技术', '企业', '此前', '曾', '因', '涉嫌', '虚假', '宣传', '中国', '最', '大', '的', '蛋品', '企业', '而', '被', '罚', '6万', '元']

进行英文分词：

Englisth=‘Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.’

六、nltk：

import nltk
import re
english='H:\\自然语言处理\\Experiment2\\English.txt'
with open(english,'r',encoding='utf-8') as file:
    u=file.read()
str=re.sub('[^\w ]','',u)
print(nltk.word_tokenize(str))
print(nltk.pos_tag(nltk.word_tokenize(str))) #对分完词的结果进行词性标注

结果：

['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', 'Brooklyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']

七、spacy：

import spacy
nlp=spacy.load('en_core_web_sm')
document=nlp(str)
print(document.text.split())

结果：

八、StanfordcoreNLP：

nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='en')
print(nlp.word_tokenize(str))

结果;

以上就是八种分词工具的分词过程，我建议：中文分词使用 jieba进行分词，英文使用 NLTK进行分词。

重庆城市开发者社区

长江两岸老火锅，共聚山城开发者！We Want You！

更多推荐

Claude 3 大模型再度点燃 AI 战火，性能和速度全面超越 GPT-4

重庆城市开发者社区

马斯克怒斥Sam Altman“变心”，发起诉讼，网友：OpenAI AGI计划因此被推迟到了2027年！...

整理|苏宓出品 | CSDN（ID：CSDNnews）树大招风的 OpenAI 因脚步太快，纷争不断。2024 年 2 月的最后一天，特斯拉 CEO 埃隆·马斯克对刚回归正轨不久的 OpenAI 又丢了一枚“深水炸弹”，向旧金山法院提交了一份 46 页的诉讼报告，宣布起诉 OpenAI 及其 CEO Sam Altman、还有 OpenAI 的联合创始人和总裁 Greg Brockman。山雨..