Jieba、NLTK等中英文分词工具进行分词

实验目的：利用给定的中英文文本序列（见 Chinese.txt 和 English.txt），分别利用以下给定的中英文分词工具进行分词并对不同分词工具产生的结果进行简要对比分析。实验工具：中文 Jieba（重点），尝试三种分词模式与自定义词典功能、SnowNLP、THULAC、NLPIR、StanfordCoreNLP、英文 NLTK、SpaCy、StanfordCore...

木禾DING

31581人浏览 · 2019-03-20 23:03:40

木禾DING · 2019-03-20 23:03:40 发布

实验目的：

利用给定的中英文文本序列（见 Chinese.txt 和 English.txt），分别利用以下给定的中

英文分词工具进行分词并对不同分词工具产生的结果进行简要对比分析。

实验工具：

中文 Jieba（重点），尝试三种分词模式与自定义词典功能、SnowNLP、THULAC、NLPIR、StanfordCoreNLP、

英文 NLTK、SpaCy、StanfordCoreNLP

实验环境：

语言：Python 3.7.0

IDE： Pycharm

需要使用 pip 安装很多包，这里请大家去搜索相关教程安装

实验步骤：

首先进行中文分词：

一、jieba

import jieba
import re
Chinese=‘央视315晚会曝光湖北省知名的神丹牌、莲田牌“土鸡蛋”实为普通鸡蛋冒充，同时在商标上玩猫腻，分别注册“鲜土”、注册“好土”商标，让消费者误以为是“土鸡蛋”。3月15日晚间，新京报记者就此事致电湖北神丹健康食品有限公司方面，其工作人员表示不知情，需要了解清楚情况，截至发稿暂未取得最新回应。新京报记者还查询发现，湖北神丹健康食品有限公司为农业产业化国家重点龙头企业、高新技术企业，此前曾因涉嫌虚假宣传“中国最大的蛋品企业”而被罚6万元。’

str=re.sub('[^\w]','',chinese)   #使用正则去符号，之后都是用这个str字符串

seg_list=jieba.cut(s_list, cut_all=False) #精确模式
print('/'.join(seg_list))

分词的结果：

央视/315/晚会/曝光/湖北省/知名/的/神丹/牌莲田牌/土/鸡蛋/实为/普通/鸡蛋/冒充/同时/在/商标/上/玩/猫腻/分别/注册/鲜土/注册/好土/商标/让/消费者/误以为/是/土/鸡蛋/3/月/15/日/晚间/新/京报/记者/就/此事/致电/湖北/神丹/健康/食品/有限公司/方面/其/工作人员/表示/不知情/需要/了解/清楚/情况/截至/发稿/暂未/取得/最新/回应/新/京报/记者/还/查询/发现/湖北/神丹/健康/食品/有限公司/为/农业/产业化/国家/重点/龙头企业/高新技术/企业/此前/曾/因涉嫌/虚假/宣传/中国/最大/的/蛋品/企业/而/被/罚/6/万元

载入自创建的词典

使用 jieba.load_userdict(file)

file=open(dict,'r')   # 载入一个词典，这个词典的内容为：神丹牌、莲花牌、土鸡蛋、新京报
jieba.load_userdict(file)
file.close()
seg_list=jieba.cut(str, cut_all=False) #精确模式  str 为之前的字符串
print('/'.join(seg_list))

结果：

央视/315/晚会/曝光/湖北省/知名/的/神丹牌/莲田牌/土鸡蛋/实为/普通/鸡蛋/冒充/同时/在/商标/上/玩/猫腻/分别/注册/鲜土/注册/好土/商标/让/消费者/误以为/是/土鸡蛋/3/月/15/日/晚间/新京报/记者/就/此事/致电/湖北/神丹/健康/食品/有限公司/方面/其/工作人员/表示/不知情/需要/了解/清楚/情况/截至/发稿/暂未/取得/最新/回应/新京报/记者/还/查询/发现/湖北/神丹/健康/食品/有限公司/为/农业/产业化/国家/重点/龙头企业/高新技术/企业/此前/曾/因涉嫌/虚假/宣传/中国/最大/的/蛋品/企业/而/被/罚/6/万元

显然载入词典之后，神丹牌、莲花牌、土鸡蛋、新京报合在了一起

二、SnowNlp

from snownlp import SnowNLP
s=SnowNLP(str)   #str为之前去掉符号的中文字符串
print(s.words)   #进行分词
print(s.pinyin)  #得到拼音
print(s.summary(3)) #进行总结 summary
print(s.keywords(3)) # 得到关键词
print(s.han)  #把繁体字变成简体字

分词的结果：是一个列表

['央视', '315', '晚会', '曝光', '湖北省', '知名', '的', '神丹', '牌', '莲', '田', '牌', '土', '鸡蛋', '实', '为', '普通', '鸡蛋', '冒充', '同时', '在', '商标', '上', '玩猫', '腻', '分别', '注册', '鲜', '土', '注册', '好', '土', '商标', '让', '消费者', '误', '以为', '是', '土', '鸡蛋', '3', '月', '15', '日', '晚间', '新京', '报', '记者', '就', '此事', '致电', '湖北', '神', '丹', '健康', '食品', '有限公司', '方面', '其', '工作', '人员', '表示', '不', '知情', '需要', '了解', '清楚', '情况', '截至', '发稿', '暂', '未', '取得', '最新', '回应', '新京', '报', '记者', '还', '查询', '发现', '湖北', '神', '丹', '健康', '食品', '有限公司', '为', '农业', '产业化', '国家', '重点', '龙头', '企业', '高新技术', '企业', '此前', '曾', '因', '涉嫌', '虚假', '宣传', '中国', '最', '大', '的', '蛋品', '企业', '而', '被', '罚', '6', '万', '元']

三、Thulac

t=thulac.thulac()  #进行分词和标注词性
text=t.cut(str,text=False) #进行分词和标注词性,若text=True 则为 str，否则为默认模式返回值为list
print(text)

[['央视', 'v'], ['315', 'm'], ['晚会', 'n'], ['曝光', 'v'], ['湖北省', 'ns'], ['知名', 'a'], ['的', 'u'], ['神丹牌', 'nz'], ['莲田牌', 'nz'], ['土鸡蛋', 'n'], ['实', 'a'], ['为', 'v'], ['普通', 'a'], ['鸡蛋', 'n'], ['冒充', 'v'], ['同时', 'd'], ['在', 'p'], ['商标', 'n'], ['上', 'f'], ['玩', 'v'], ['猫腻', 'n'], ['分别', 'd'], ['注册', 'v'], ['鲜土', 'n'], ['注册', 'v'], ['好', 'a'], ['土', 'n'], ['商标', 'n'], ['让', 'v'], ['消费者', 'n'], ['误', 'd'], ['以为', 'v'], ['是', 'v'], ['土鸡蛋', 'n'], ['3月', 't'], ['15日', 't'], ['晚间', 't'], ['新', 'a'], ['京报', 'n'], ['记者', 'n'], ['就', 'p'], ['此事', 'r'], ['致电', 'v'], ['湖北', 'ns'], ['神丹', 'nz'], ['健康', 'a'], ['食品', 'n'], ['有限公司', 'n'], ['方面', 'n'], ['其', 'r'], ['工作', 'v'], ['人员', 'n'], ['表示', 'v'], ['不', 'd'], ['知情', 'v'], ['需要', 'v'], ['了', 'u'], ['解', 'v'], ['清楚', 'a'], ['情况', 'n'], ['截至', 'v'], ['发稿', 'v'], ['暂', 'd'], ['未', 'd'], ['取得', 'v'], ['最新', 'a'], ['回应', 'v'], ['新', 'a'], ['京报', 'n'], ['记者', 'n'], ['还', 'd'], ['查询', 'v'], ['发现', 'v'], ['湖北', 'ns'], ['神丹', 'nz'], ['健康', 'a'], ['食品', 'n'], ['有限公司', 'n'], ['为', 'p'], ['农业', 'n'], ['产业化', 'v'], ['国', 'm'], ['家', 'q'], ['重点', 'n'], ['龙头', 'n'], ['企业', 'n'], ['高新技术', 'n'], ['企业', 'n'], ['此前', 't'], ['曾', 'd'], ['因', 'p'], ['涉嫌', 'v'], ['虚假', 'a'], ['宣传', 'v'], ['中国', 'ns'], ['最', 'd'], ['大', 'a'], ['的', 'u'], ['蛋品', 'n'], ['企业', 'n'], ['而', 'c'], ['被', 'p'], ['罚', 'v'], ['6万', 'm'], ['元', 'q']]

若

t2=thulac.thulac(seg_only=True)  #只进行分词 segment

则只进行分词，不标注词性

四、Pynlpir

pynlpir.open()
print(pynlpir.segment(str)) #分词

[('央', 'verb'), ('视', 'verb'), ('315', 'numeral'), ('晚会', 'noun'), ('曝光', 'verb'), ('湖北省', 'noun'), ('知名', 'adjective'), ('的', 'particle'), ('神', 'noun'), ('丹', 'distinguishing word'), ('牌', 'noun'), ('、', 'punctuation mark'), ('莲', 'noun'), ('田', 'noun'), ('牌', 'noun'), ('“', 'punctuation mark'), ('土', 'noun'), ('鸡蛋', 'noun'), ('”', 'punctuation mark'), ('实', 'adjective'), ('为', 'verb'), ('普通', 'adjective'), ('鸡蛋', 'noun'), ('冒充', 'verb'), ('，', 'punctuation mark'), ('同时', 'conjunction'), ('在', 'preposition'), ('商标', 'noun'), ('上', 'noun of locality'), ('玩', 'verb'), ('猫腻', 'noun'), ('，', 'punctuation mark'), ('分别', 'adverb'), ('注册', 'verb'), ('“', 'punctuation mark'), ('鲜', 'adjective'), ('土', 'noun'), ('”', 'punctuation mark'), ('、', 'punctuation mark'), ('注册', 'verb'), ('“', 'punctuation mark'), ('好', 'adjective'), ('土', 'noun'), ('”', 'punctuation mark'), ('商标', 'noun'), ('，', 'punctuation mark'), ('让', 'verb'), ('消费者', 'noun'), ('误', 'adverb'), ('以为', 'verb'), ('是', 'verb'), ('“', 'punctuation mark'), ('土', 'noun'), ('鸡蛋', 'noun'), ('”', 'punctuation mark'), ('。', 'punctuation mark'), ('3月', 'time word'), ('15日', 'time word'), ('晚间', 'time word'), ('，', 'punctuation mark'), ('新京报', None), ('记者', 'noun'), ('就', 'adverb'), ('此事', 'pronoun'), ('致电', 'verb'), ('湖北', 'noun'), ('神', 'noun'), ('丹', 'distinguishing word'), ('健康', 'adjective'), ('食品', 'noun'), ('有限公司', 'noun'), ('方面', 'noun'), ('，', 'punctuation mark'), ('其', 'pronoun'), ('工作', 'verb'), ('人员', 'noun'), ('表示', 'verb'), ('不', 'adverb'), ('知', 'verb'), ('情', 'noun'), ('，', 'punctuation mark'), ('需要', 'verb'), ('了解', 'verb'), ('清楚', 'adjective'), ('情况', 'noun'), ('，', 'punctuation mark'), ('截至', 'verb'), ('发稿', 'verb'), ('暂', 'adverb'), ('未', 'adverb'), ('取得', 'verb'), ('最新', 'adjective'), ('回应', 'verb'), ('。', 'punctuation mark'), ('新京报', None), ('记者', 'noun'), ('还', 'adverb'), ('查询', 'verb'), ('发现', 'verb'), ('，', 'punctuation mark'), ('湖北', 'noun'), ('神', 'noun'), ('丹', 'distinguishing word'), ('健康', 'adjective'), ('食品', 'noun'), ('有限公司', 'noun'), ('为', 'preposition'), ('农业', 'noun'), ('产业化', 'verb'), ('国家', 'noun'), ('重点', 'noun'), ('龙头', 'noun'), ('企业', 'noun'), ('、', 'punctuation mark'), ('高新技术', 'noun'), ('企业', 'noun'), ('，', 'punctuation mark'), ('此前', 'time word'), ('曾', 'adverb'), ('因', 'preposition'), ('涉嫌', 'verb'), ('虚假', 'adjective'), ('宣传', 'verb'), ('“', 'punctuation mark'), ('中国', 'noun'), ('最', 'adverb'), ('大', 'adjective'), ('的', 'particle'), ('蛋品', 'noun'), ('企业', 'noun'), ('”', 'punctuation mark'), ('而', 'conjunction'), ('被', 'preposition'), ('罚', 'verb'), ('6万', 'numeral'), ('元', 'classifier'), ('。', 'punctuation mark')]

五、StanfordCoreNLP：

nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='zh')
print(nlp.word_tokenize(s_list)) #返回一个列表
# print(nlp.pos_tag(str))  #词性标注
# print(nlp.parse(str))  #解析

结果：

['央视', '315', '晚会', '曝光', '湖北省', '知名', '的', '神丹', '牌', '莲', '田', '牌', '土', '鸡蛋', '实为', '普通', '鸡蛋', '冒充', '同时', '在', '商标', '上', '玩', '猫腻', '分别', '注册', '鲜土', '注册', '好', '土', '商标', '让', '消费者', '误以为', '是', '土', '鸡蛋', '3月', '15日', '晚间', '新京报', '记者', '就此事', '致电', '湖北', '神丹', '健康', '食品', '有限', '公司', '方面', '其', '工作', '人员', '表示', '不知情', '需要', '了解', '清楚', '情况', '截至', '发稿', '暂', '未', '取得', '最新', '回应', '新京报', '记者', '还', '查询', '发现', '湖北', '神丹', '健康', '食品', '有限', '公司', '为', '农业', '产业化', '国家', '重点', '龙头', '企业', '高', '新', '技术', '企业', '此前', '曾', '因', '涉嫌', '虚假', '宣传', '中国', '最', '大', '的', '蛋品', '企业', '而', '被', '罚', '6万', '元']

进行英文分词：

Englisth=‘Trump was born and raised in the New York City borough of Queens and received an economics degree from the Wharton School. He was appointed president of his family's real estate business in 1971, renamed it The Trump Organization, and expanded it from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, including licensing his name for real estate and consumer products. He managed the company until his 2017 inauguration. He co-authored several books, including The Art of the Deal. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and he produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.’

六、nltk：

import nltk
import re
english='H:\\自然语言处理\\Experiment2\\English.txt'
with open(english,'r',encoding='utf-8') as file:
    u=file.read()
str=re.sub('[^\w ]','',u)
print(nltk.word_tokenize(str))
print(nltk.pos_tag(nltk.word_tokenize(str))) #对分完词的结果进行词性标注

结果：

['Trump', 'was', 'born', 'and', 'raised', 'in', 'the', 'New', 'York', 'City', 'borough', 'of', 'Queens', 'and', 'received', 'an', 'economics', 'degree', 'from', 'the', 'Wharton', 'School', 'He', 'was', 'appointed', 'president', 'of', 'his', 'familys', 'real', 'estate', 'business', 'in', '1971', 'renamed', 'it', 'The', 'Trump', 'Organization', 'and', 'expanded', 'it', 'from', 'Queens', 'and', 'Brooklyn', 'into', 'Manhattan', 'The', 'company', 'built', 'or', 'renovated', 'skyscrapers', 'hotels', 'casinos', 'and', 'golf', 'courses', 'Trump', 'later', 'started', 'various', 'side', 'ventures', 'including', 'licensing', 'his', 'name', 'for', 'real', 'estate', 'and', 'consumer', 'products', 'He', 'managed', 'the', 'company', 'until', 'his', '2017', 'inauguration', 'He', 'coauthored', 'several', 'books', 'including', 'The', 'Art', 'of', 'the', 'Deal', 'He', 'owned', 'the', 'Miss', 'Universe', 'and', 'Miss', 'USA', 'beauty', 'pageants', 'from', '1996', 'to', '2015', 'and', 'he', 'produced', 'and', 'hosted', 'The', 'Apprentice', 'a', 'reality', 'television', 'show', 'from', '2003', 'to', '2015', 'Forbes', 'estimates', 'his', 'net', 'worth', 'to', 'be', '31', 'billion']

七、spacy：

import spacy
nlp=spacy.load('en_core_web_sm')
document=nlp(str)
print(document.text.split())

结果：

八、StanfordcoreNLP：

nlp=StanfordCoreNLP(r'G:\\stanford-corenlp-full-2018-10-05\\stanford-corenlp-full-2018-10-05',lang='en')
print(nlp.word_tokenize(str))

结果;

以上就是八种分词工具的分词过程，我建议：中文分词使用 jieba进行分词，英文使用 NLTK进行分词。

重庆城市开发者社区

长江两岸老火锅，共聚山城开发者！We Want You！

更多推荐

【2023 CSDN年度报告】2023年属于哪种创作风格？一起来测！

重庆城市开发者社区

Claude 3 大模型再度点燃 AI 战火，性能和速度全面超越 GPT-4

重庆城市开发者社区

OpenAI承认ChatGPT变懒惰，正在修复该问题

重庆城市开发者社区

所有评论(0)

查看更多评论

木禾DING

@ding_programmer

已为社区贡献1条内容