参考链接:

  1. https://github.com/maciejkula/glove-python
  2. https://blog.csdn.net/sinat_26917383/article/details/83029140
  3. https://blog.csdn.net/beilizhang/article/details/108175380

说明

本教程需调用glove_python这个包,而不采用Stanford的GloVe,因为前者是python的比较亲民。

天坑

glove_python只支持到Python3.5,更高的版本是不行的。
如果你的电脑没有Python3.5,可以通过anaconda新建一个环境。参考这里

然后执行
pip install libpython
pip install glove_python
必要时,请使用清华镜像源:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple glove-python

处理数据

本教程使用PTB(Penn Tree Bank)小型语料库的训练集,数据集每一行为一句话。处理方式如下:

with open('data/ptb/ptb.train.txt', 'r') as f:
    lines = f.readlines()
    raw_dataset = [st.split() for st in lines]

构建共现矩阵

这里可以设置你的窗口大小。

# construct a cooccurrence matrix from a corpus
corpus_model = Corpus()
corpus_model.fit(raw_dataset, window=10)

构建模型并训练

构建模型时可以选择词向量的维度(no_components)和学习率(learning_rate)
训练时可以选择训练的轮数(epochs)、线程数(no_threads)

glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus_model.matrix, epochs=10,
          no_threads=1, verbose=True)

求相似词

注意要先添加词典

# Supply a word-id dictionary to allow similarity queries.
glove.add_dictionary(corpus_model.dictionary)

print(glove.most_similar('chip', number=10))

访问任意词的词向量

print(glove.word_vectors[glove.dictionary['chip']])

模型的保存与加载

1)GloVe模型的保存与加载

glove.save('glove.model')
glove = Glove.load('glove.model')

2)Corpus的保存与加载

corpus_model.save('corpus.model')
corpus_model = Corpus.load('corpus.model')

代码汇总

#coding:utf-8
from glove import Glove
from glove import Corpus

with open('data/ptb/ptb.train.txt', 'r') as f:
    lines = f.readlines()
    raw_dataset = [st.split() for st in lines]

# construct a cooccurrence matrix from a corpus
corpus_model = Corpus()
corpus_model.fit(raw_dataset, window=10)

#corpus_model.save('corpus.model')
print('Dict size: %s' % len(corpus_model.dictionary))
print('Collocations: %s' % corpus_model.matrix.nnz)

glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus_model.matrix, epochs=10,
          no_threads=1, verbose=True)

# Supply a word-id dictionary to allow similarity queries.
glove.add_dictionary(corpus_model.dictionary)

print(glove.most_similar('chip', number=10))

print(glove.word_vectors[glove.dictionary['chip']])

# save and load
glove.save('glove.model')
glove = Glove.load('glove.model')

corpus_model.save('corpus.model')
corpus_model = Corpus.load('corpus.model')


Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐