word2vec笔记

word2vec原始版本应用可以参考博文：http://blog.csdn.net/jj12345jj198999/article/details/11069485在linux上安装使用的步骤大概是：下载源码，make，执行如下命令进行训练：./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -

hyperminer

3485人浏览 · 2016-11-18 11:22:33

hyperminer · 2016-11-18 11:22:33 发布

  word2vec原始版本应用可以参考博文： 
 http://blog.csdn.net/jj12345jj198999/article/details/11069485 

  在linux上安装使用的步骤大概是：下载源码，make，执行如下命令进行训练： 

  ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1   

  //这里指定输出为vectors.bin文件，显然输出到文件便于以后重复利用，省得每次都要计算一遍。 

  输入如下代码就行开启交互式找同义词： 

  ./distance vectors.bin   

  聚类执行如下代码： 

  ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500   

  按类别排序： 

  sort classes.txt -k 2 -n > classes.sorted.txt   

 
 训练集中通常需要将 
 每篇文章转换位1行text文本，并且去掉了标点符号等内容 

  这里使用的是google word2vec的python接口版本： 
 https://github.com/danielfrg/word2vec 

  （1）安装： 

  $ pip install word2vec #报下面的错误 

  Downloading/unpacking word2vec 

  Downloading word2vec-0.9.1.tar.gz (49kB): 49kB downloaded 

  Running setup.py (path:/tmp/pip_build_zhangwj/word2vec/setup.py) egg_info for package word2vec 

  Traceback (most recent call last): 

  File "<string>", line 17, in <module> 

  File "/tmp/pip_build_zhangwj/word2vec/setup.py", line 23, in <module> 

  from Cython.Build import cythonize 

  ImportError: No module named Cython.Build 

  Complete output from command python setup.py egg_info: 

  Traceback (most recent call last): 

  File "<string>", line 17, in <module> 

  File "/tmp/pip_build_zhangwj/word2vec/setup.py", line 23, in <module> 

  from Cython.Build import cythonize 

 
 ImportError: No module named Cython.Build 

  需要先安装Cython： 

  $ sudo pip install Cython #注意如果没有sudo，安装过程中会因为权限问题报错。 

  安装 word2vec： 

  $ sudo pip install word2vec #同样需要sudo 

  （2）在本机编写测试代码： 

  参考项目中examples下的word2vec.ipynb文件： 

 
 #encoding=utf-8 

 
 import  
 word2vec 

 
 word2vec.word2vec("files/data_fenci.txt","files/data_fenci.bin",size=100,verbose=True)  

 
 #通过训练预料生成bin二进制模型，模型生成后下次执行可以不用再调用该语句，直接用下面的方法加载即可。 

 
 model = word2vec.load( 
 "files/data_fenci.bin" 
 ) 

 
 indexs,metrics = model.cosine( 
 u'感冒' 
 ) #查找与感冒相似的语句 

 
 print  
 indexs,metrics 

 
 # print model.vocab[indexs] 

 
 for  
 ele,similarity  
 in  
 model.generate_response(indexs,metrics).tolist(): 

 
 print  
 ele,similarity

  （3）代码分析 

  word2vec.word2vec是调用的构造函数，在scripts_interface.py文件中，构造函数如下： 

 
 def  
 word2vec(train, output, size= 
 100 
 , window= 
 5 
 , sample= 
 '1e-3' 
 , hs= 
 0 
 , 

 
 negative= 
 5 
 , threads= 
 12 
 , iter_= 
 5 
 , min_count= 
 5 
 , alpha= 
 0.025 
 , 

 
 debug= 
 2 
 , binary= 
 1 
 , cbow= 
 1 
 , save_vocab=None, read_vocab=None, 

 
 verbose=False): 

"""

 
 word2vec execution 

 
 Parameters for training: 

 
 train <file> 训练数据 

 
 Use text data from <file> to train the model 

 
 output <file> 输出 

 
 Use <file> to save the resulting word vectors / word clusters 

 
 size <int> vector大小 

 
 Set size of word vectors; default is 100 

 
 window <int> 窗口 

 
 Set max skip length between words; default is 5 

 
 sample <float>  

 
 Set threshold for occurrence of words. Those that appear with 

 
 higher frequency in the training data will be randomly 

 
 down-sampled; default is 0 (off), useful value is 1e-5 

 
 hs <int> 层级softmax，默认使用 

 
 Use Hierarchical Softmax; default is 1 (0 = not used) 

 
 negative <int> 负采样，默认不使用 

 
 Number of negative examples; default is 0, common values are 5 - 10 

 
 (0 = not used) 

 
 threads <int> 

 
 Use <int> threads (default 1) 

 
 min_count <int> 

 
 This will discard words that appear less than <int> times; default 

 
 is 5 

 
 alpha <float> 

 
 Set the starting learning rate; default is 0.025 

 
 debug <int> 

 
 Set the debug mode (default = 2 = more info during training) 

 
 binary <int> 

 
 Save the resulting vectors in binary moded; default is 0 (off) 

 
 cbow <int> 

 
 Use the continuous back of words model; default is 1 (skip-gram 

 
 model) 

 
 save_vocab <file> 

 
 The vocabulary will be saved to <file> 

 
 read_vocab <file> 

 
 The vocabulary will be read from <file>, not constructed from the 

 
 training data 

 
 verbose 

 
 Print output from training 

  该文件中还包含 
 word2clusters，word2phrase，doc2vec等方法。这些方法会将传递进来的参数拼到str中，然后调用run_cmd去执行。 

  word2vec还提供了experimental的 
 doc2vec，参考： 
 https://github.com/zhangweijiqn/word2vec/blob/master/examples/doc2vec.ipynb 

  TensorFlow 的Word2Vec: 

 
 TensorFlow中生成的词向量(Word Embedding) 

  官网： 
 https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html 

  测试代码需要安装 sklearn 和 matplotlib， 

  安装sklearn，参考：: 

  pip install -U scikit-learn #报错，缺少scipy包 

  参考scipy官网安装方法： 
 http://www.scipy.org/install.html 

 
 sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose 

  报错参考： 
 http://stackoverflow.com/questions/11114225/installing-scipy-and-numpy-using-pip 

Linux

更多推荐

网卡速率和双工模式的配置

http://linux.chinaitlab.com/system/792187.html1、mii-tool 配置网络设备协商方式的工具； 1.1 mii-tool 介绍； mii-tool - view, manipulate media-independent interface status （mii-tool 是查看，管理介质的网络接口的状态）

Linux

Linux虚拟文件系统之文件系统卸载（sys_umount())

Linux中卸载文件系统由umount系统调用实现，入口函数为sys_umount()。较于文件系统的安装较为简单，下面是具体的实现。1. /*sys_umont系统调用*/2. SYSCALL_DEFINE2(umount, char __user *, name, int, flags)3. {4.struct path path;

Linux

Linux系统下超级终端Minicom的使用方法（例如：连接交换机，路由器等）转http://baike.baidu.com/view/2911642.htm?fr=ala0_1

Linux系统下超级终端Minicom的使用方法 　　Linux下的Minicom的功能与下的超级终端功能相似，适于在通过超级终端对设备的管理以及对嵌入操作系统的升级，现写出Minicom的使用手册： 　　1．启动minicom 　　以root权限登录系统 　　使用命令 　　minicom –s 则minicom启动，屏

Linux

所有评论(0)

查看更多评论

hyperminer

@zhangweijiqn

已为社区贡献1条内容