word2vec原始版本应用可以参考博文: http://blog.csdn.net/jj12345jj198999/article/details/11069485
在linux上安装使用的步骤大概是:下载源码,make,执行如下命令进行训练:
./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1  
//这里指定输出为vectors.bin文件,显然输出到文件便于以后重复利用,省得每次都要计算一遍。
输入如下代码就行开启交互式找同义词:
./distance vectors.bin  
聚类执行如下代码:
./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500  
按类别排序:
sort classes.txt -k 2 -n > classes.sorted.txt  

训练集中通常需要将 每篇文章转换位1行text文本,并且去掉了标点符号等内容

这里使用的是google word2vec的python接口版本: https://github.com/danielfrg/word2vec
(1)安装:
$ pip install word2vec #报下面的错误
Downloading/unpacking word2vec
Downloading word2vec-0.9.1.tar.gz (49kB): 49kB downloaded
Running setup.py (path:/tmp/pip_build_zhangwj/word2vec/setup.py) egg_info for package word2vec
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_zhangwj/word2vec/setup.py", line 23, in <module>
from Cython.Build import cythonize
ImportError: No module named Cython.Build
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/tmp/pip_build_zhangwj/word2vec/setup.py", line 23, in <module>
from Cython.Build import cythonize
ImportError: No module named Cython.Build
需要先安装Cython:
$ sudo pip install Cython #注意如果没有sudo,安装过程中会因为权限问题报错。
安装 word2vec:
$ sudo pip install word2vec #同样需要sudo

(2)在本机编写测试代码:
参考项目中examples下的word2vec.ipynb文件:
#encoding=utf-8
import word2vec
word2vec.word2vec("files/data_fenci.txt","files/data_fenci.bin",size=100,verbose=True)
#通过训练预料生成bin二进制模型,模型生成后下次执行可以不用再调用该语句,直接用下面的方法加载即可。
model = word2vec.load( "files/data_fenci.bin" )
indexs,metrics = model.cosine( u'感冒' ) #查找与感冒相似的语句
print indexs,metrics
# print model.vocab[indexs]
for ele,similarity in model.generate_response(indexs,metrics).tolist():
print ele,similarity

(3)代码分析
word2vec.word2vec是调用的构造函数,在scripts_interface.py文件中,构造函数如下:
def word2vec(train, output, size= 100 , window= 5 , sample= '1e-3' , hs= 0 ,
negative= 5 , threads= 12 , iter_= 5 , min_count= 5 , alpha= 0.025 ,
debug= 2 , binary= 1 , cbow= 1 , save_vocab=None, read_vocab=None,
verbose=False):
"""
word2vec execution

Parameters for training:
train <file> 训练数据
Use text data from <file> to train the model
output <file> 输出
Use <file> to save the resulting word vectors / word clusters
size <int> vector大小
Set size of word vectors; default is 100
window <int> 窗口
Set max skip length between words; default is 5
sample <float>
Set threshold for occurrence of words. Those that appear with
higher frequency in the training data will be randomly
down-sampled; default is 0 (off), useful value is 1e-5
hs <int> 层级softmax,默认使用
Use Hierarchical Softmax; default is 1 (0 = not used)
negative <int> 负采样,默认不使用
Number of negative examples; default is 0, common values are 5 - 10
(0 = not used)
threads <int>
Use <int> threads (default 1)
min_count <int>
This will discard words that appear less than <int> times; default
is 5
alpha <float>
Set the starting learning rate; default is 0.025
debug <int>
Set the debug mode (default = 2 = more info during training)
binary <int>
Save the resulting vectors in binary moded; default is 0 (off)
cbow <int>
Use the continuous back of words model; default is 1 (skip-gram
model)
save_vocab <file>
The vocabulary will be saved to <file>
read_vocab <file>
The vocabulary will be read from <file>, not constructed from the
training data
verbose
Print output from training
该文件中还包含 word2clusters,word2phrase,doc2vec等方法。这些方法会将传递进来的参数拼到str中,然后调用run_cmd去执行。

word2vec还提供了experimental的 doc2vec,参考: https://github.com/zhangweijiqn/word2vec/blob/master/examples/doc2vec.ipynb

TensorFlow 的Word2Vec:
TensorFlow中生成的词向量(Word Embedding)
测试代码需要安装 sklearn 和 matplotlib,
安装sklearn,参考::
pip install -U scikit-learn #报错,缺少scipy包
参考scipy官网安装方法: http://www.scipy.org/install.html
sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose




Logo

更多推荐