NLTK实现分词

前言本篇主要记录在用python写nltk分词操作项目主要出现的错误以及改进的方法。本文利用nltk，从数据库中获取文本并进行去停用词处理，并将处理结果放入数据库。一、nltk是什么？Natural Language Toolkit，自然语言处理工具包，在NLP领域中，最常使用的一个Python库。NLTK是一个开源的项目，包含：Python模块，数据集和教程，用于NLP的研究和开发 [1]。NL

文章共1,997字 · 阅读需要大约7分钟

一键AI生成摘要，助你高效阅读

问答

Rhichard_CHAN

5551人浏览 · 2020-10-09 13:02:42

Rhichard_CHAN · 2020-10-09 13:02:42 发布

前言

本篇主要记录在用python写nltk分词操作项目主要出现的错误以及改进的方法。
本文利用nltk，从数据库中获取文本并进行去停用词处理，并将处理结果放入数据库。

一、nltk是什么？

Natural Language Toolkit，自然语言处理工具包，在NLP领域中，最常使用的一个Python库。
NLTK是一个开源的项目，包含：Python模块，数据集和教程，用于NLP的研究和开发 [1] 。
NLTK由Steven Bird和Edward Loper在宾夕法尼亚大学计算机和信息科学系开发。
NLTK包括图形演示和示例数据。其提供的教程解释了工具包支持的语言处理任务背后的基本概念

在本文中主要用来对文本进行去停用词处理

二、实现代码

主要用到nltk包和pandas，可以通过以下命令进行安装：

pip install nltk
pip install pandas

import pymysql
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

con=pymysql.connect(
    host='localhost',
    port=3306,
    user='root',
    passwd='123',
    db='nce',
    charset='utf8',
    )
def insert(con,frequent,l):
    cue = con.cursor()
    # print("mysql conneted")
    try:
        print(str(frequent))
        print(l)
        cue.execute(
            "update article set frequent=(%s) where a_id=(%s)",[str(frequent),l])
        print("insert success")

    except Exception as e:
        print('Insert error:', e)
        con.rollback()
    else:
        con.commit()
    cue.close()

def read():
    cue = con.cursor()
    query = """select text 
    from article
    """
    stop_words = set(stopwords.words('english'))
    cue.execute(query)
    result = cue.fetchall()
    df_resulet = pd.DataFrame(list(result))
    for l in df_resulet.index:
        text = str(df_resulet.loc[l].values)
        word_tokens = word_tokenize(text[1:-1])
        
        filtered_sentence = [w for w in word_tokens if not w in stop_words]
        # print(filtered_sentence[1:-1])
        insert(con,filtered_sentence[1:-1],l+1)
read ()
con.close()

三、过程出现的错误

1. 数组格式不能直接用于String类型

df_resulet = pd.DataFrame(list(result)) 
   for l in df_resulet.index:
        text = df_resulet.loc[l].values

报错代码如下：

TypeError: cannot use a string pattern on a bytes-like object

改进方法：就直接强转成string类型就行

 text = str(df_resulet.loc[l].values)

2.插入错误

一：代码如下（示例）：

def insert(con,frequent,l):
    cue = con.cursor()
    # print("mysql conneted")
    try:
        # print(frequent)
        cue.execute(
           "insert into article (frequent) values(%s)",[frequent])
        print("insert success")

    except Exception as e:
        print('Insert error:', e)
        con.rollback()
    else:
        con.commit()

Insert error: (1241, 'Operand should contain 1 column(s)')

这里的错误是说：插入的数据应该包含一列，也就是说我插入的数据不止一列。

解决办法：

首先，我传入的是在def read（）中强转str0的变量，拿到sql语句中，就变成了数组，所以是有多少个字符，就有多少个列，这样当然插入不进，只要在语句中再强转一次就行。

修改后代码如下：

cue.execute(
            "insert into article (frequent) values(%s)",str(frequent))

3.sql更新错误

    try:
        cue.execute(
            "update article set frequent=(%s) where a_id=(%s)",[str(frequent),l])
        print("insert success")

Lock wait timeout exceeded; try restarting transaction

原因：
因为sql的update查询语句是很耗时的，在查询过程导致锁了，每次更新操作等了50秒还是失败，解决办法也很简单，

1,查看当前数据库的线程情况：

SHOW FULL PROCESSLIST

在这里插入图片描述
查看有没耗时特别长的，再去查看innodb的事务表INNODB_TRX，看下里面是否有正在锁定的事务线程，看看ID是否在show full processlist里面的sleep线程中，如果是，就证明这个sleep的线程事务一直没有commit或者rollback而是卡住了，直接kill掉。