莫烦nlp——词向量—CBOW

由于不是第一次接触，本文只摘录莫烦关于词向量的观点。更多的关注代码。上一次系统学习莫烦教程已经一年半了，时间过得太快了。转载：https://mofanpy.com/tutorials/machine-learning/nlp/intro-w2v/自然语言处理学会语言，往往是从学会词语的意思开始，将词组成话，才让语言有了意义。计算机识别图片就是在这些图片背后的数字上寻找规律。那么在语言上怎样用数字

卢容和

852人浏览 · 2021-01-07 19:45:26

卢容和 · 2021-01-07 19:45:26 发布

由于不是第一次接触，本文只摘录莫烦关于词向量的观点。更多的关注代码。
上一次系统学习莫烦教程已经一年半了，时间过得太快了。

转载：https://mofanpy.com/tutorials/machine-learning/nlp/intro-w2v/

自然语言处理

学会语言，往往是从学会词语的意思开始，将词组成话，才让语言有了意义。
计算机识别图片就是在这些图片背后的数字上寻找规律。那么在语言上怎样用数字表达呢？

答：计算机之所以能看懂字里行间的感情，理解文字，处理文字，并不是因为它理解的我们普罗万象的人类语言，而是它将语言或者词汇归类到了一个正确的位置上。计算机对词语的理解，其实是计算机对空间及位置的理解。

只要是能被数值化，被投射到某个空间中，计算机都能把它们按相似度聚集起来。
相似度标准：如果只想测量两个词的相似度，角度信息也足够了。

另外： 频率太高，和很多字都能混搭的词算是之中机器认为的“中性词”，越有区分力的词可能越远离中心地带，因为他们和其他词都不像，而越通用，在每种场景都有的词，就可能越靠近原点。这时，点与点的距离就能告诉我们词的频率性特征。

词向量技术

在这里插入图片描述
步骤：
训练时，我们取一小段文本，取出这些词的向量表示，比如取出除了“一”字以外的词向量，然后整合到一起，表示这些文字的整体向量，用这个整体向量预测最中间那个“一”。接下来在开始下一段文字的训练。

将这个窗口挪动一格，用前后文预测“段”字，接着将窗口依次这样扫过所有文字，用所有的前后文预测中间词，这样计算机就能将前后文的关系搞清楚，~~挨得近的词他们的关系越亲密~~ 总出现在类似的上下文中间的词关系越亲密。向量在一定程度上也越相近。除了用前后文预测中间词，我们还能换一个思路，用中间词预测前后文也行。

原理：
假设是，在某个词的周围，应该都是和这个词有关系的词，所以当我们预测关联词的时候也就会拉近这些关联词的距离，把相近的词聚集到一起，从而得到所有的词向量。

用途：
词向量是一种预训练特征。用word2vec 的方法预先训练好了词语的特征表达，然后在其他场景中拿着预训练结果直接使用。

continuous bag of words（CBOW）

这里是手写实现（不长）。之前李宏毅的homework4——感情分类，用的是gensim库实现word2vec。学习时间有限，没必要自己亲手实现，而且还没有预处理，优化的例子。使用的是TensorFlow2.+，那我还是来欣赏一下吧，我也没用过tf2.+

挑一个要预测的词，来学习这个词前后文中词语和预测词的关系。
在这里插入图片描述

sum的不合理之处： 句子是由词语组成的，那么有一种理解句子的方式，就是将这个句子中所有词语的词向量都加起来，然后就变成了句子的理解。不过这种空间上的向量相加，从直观上理解，就不是特别成立，因为它加出来以后，还是在这个词汇空间中的某个点，你说它是句向量吧，好像也不行，说它是一个词的理解吧，好像也不对。

所以更常用的方式是将这些训练好的词向量当做预训练模型，然后放入另一个神经网络（比如RNN）当成输入，使用另一个神经网络加工后，训练句向量。

莫烦视频：为了汇总前后文的信息，把前后文的词向量相加变成前后文向量。相加的过程：这个操作还是处于词向量的空间，并非是用另外一个前后文向量的空间去表示前后文信息。另一种词向量训练方法Skip-Gram，我个人更喜欢后者，因为后者没有我觉得说不太通的词向量相加过程。

Loss——tf.nn.nce_loss

噪声对比估计(NCE, Noise Contrastive Estimation)
使用nce_loss能够大大加速softmax求loss的方式，它不关心所有词汇loss，而是抽样选取几个词汇用来传递loss，因为如果考虑所有词汇，那么当词汇量大的时候，会很慢。
如果不关心词汇量大导致softmax的计算量大，则用softmax-cross-entropy代替。

tf.nn.embedding原理

nce_loss原理
tf.nn.nce_loss参考链接
下面第一句话是重点

==假设nce_loss之前的输入数据是K维的，一共有N个类，==
那么
weight.shape = (N, K)
bias.shape = (N)
inputs.shape = (batch_size, K)
labels.shape = (batch_size, num_true)
num_true : 实际的正样本个数
num_sampled: 采样出多少个负样本
num_classes = N

nce_loss的实现逻辑如下：
_compute_sampled_logits: 通过这个函数计算出正样本和采样出的负样本对应的output和label

sigmoid_cross_entropy_with_logits: 通过 sigmoid cross entropy来计算output和label的loss，从而进行反向传播。这个函数把最后的问题转化为了num_sampled+num_real个两类分类问题，然后每个分类问题用了交叉熵的损伤函数，也就是logistic regression常用的损失函数。TF里还提供了一个softmax_cross_entropy_with_logits的函数，和这个有所区别。——这个我是懂的，不知道读者

log_uniform_candidate_sampler是怎么采样的呢？
k越大，被采样到的概率越小。

TF的word2vec实现里，词频越大，词的类别编号也就越大越小。因此，在TF的word2vec里，负采样的过程其实就是优先采词频高的词作为负样本。

还有一篇解释参数的

使用nce_loss训练mnist数据集会更好理解nce_loss这一层是干嘛的

nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                            stddev=1.0/math.sqrt(embedding_size)),name="embed")

 tf.nn.nce_loss(weights=nce_weights,
                       biases=nce_biases,
                       labels=y_idx,
                       inputs=fc1_drop,
                       num_sampled=num_sampled,
                       num_classes=vocabulary_size),
    )
output = tf.matmul(fc1_drop, tf.transpose(nce_weights)) + nce_biase

最后的tf.matmul说明，tf.nn.nce_loss的作用与其相似。

例子，找出卧底数字‘9’

# -*- coding: utf-8 -*-
"""
Created on Thu Jan  7 15:44:39 2021

@author: Administrator
"""
import matplotlib.pyplot as plt
import numpy as np
import itertools

class Dataset:
    def __init__(self, x, y, v2i, i2v):
        self.x, self.y = x, y
        self.v2i, self.i2v = v2i, i2v
        self.vocab = v2i.keys()

    def sample(self, n):
        b_idx = np.random.randint(0, len(self.x), n)
        bx, by = self.x[b_idx], self.y[b_idx]
        return bx, by

    @property
    def num_word(self):
        return len(self.v2i)

def show_w2v_word_embedding(model, data: Dataset, path):
    word_emb = model.embeddings.get_weights()[0]
    for i in range(data.num_word):
        c = "blue"
        try:
            int(data.i2v[i])
        except ValueError:
            c = "red"
        plt.text(word_emb[i, 0], word_emb[i, 1], s=data.i2v[i], color=c, weight="bold")
    plt.xlim(word_emb[:, 0].min() - .5, word_emb[:, 0].max() + .5)
    plt.ylim(word_emb[:, 1].min() - .5, word_emb[:, 1].max() + .5)
    plt.xticks(())
    plt.yticks(())
    plt.xlabel("embedding dim1")
    plt.ylabel("embedding dim2")
    plt.savefig(path, dpi=300, format="png")
    plt.show()

def process_w2v_data(corpus, skip_window=2, method="skip_gram"):
    all_words = [sentence.split(" ") for sentence in corpus]
    all_words = np.array(list(itertools.chain(*all_words)))
    # vocab sort by decreasing frequency for the negative sampling below (nce_loss).
    #eturn the number of times each unique item appears in ar.
    vocab, v_count = np.unique(all_words, return_counts=True)
    vocab = vocab[np.argsort(v_count)[::-1]]

    print("all vocabularies sorted from more frequent to less frequent:\n", vocab)
    v2i = {v: i for i, v in enumerate(vocab)}
    i2v = {i: v for v, i in v2i.items()}

    # pair data
    pairs = []
    js = [i for i in range(-skip_window, skip_window + 1) if i != 0]

    for c in corpus:
        words = c.split(" ")
        w_idx = [v2i[w] for w in words]
        if method == "skip_gram":
            for i in range(len(w_idx)):
                for j in js:
                    if i + j < 0 or i + j >= len(w_idx):
                        continue
                    pairs.append((w_idx[i], w_idx[i + j]))  # (center, context) or (feature, target)
        elif method.lower() == "cbow":
            for i in range(skip_window, len(w_idx) - skip_window):
                context = []
                for j in js:
                    context.append(w_idx[i + j])
                pairs.append(context + [w_idx[i]])  # (contexts, center) or (feature, target)
        else:
            raise ValueError
    pairs = np.array(pairs)
    print("5 example pairs:\n", pairs[:5])
    if method.lower() == "skip_gram":
        x, y = pairs[:, 0], pairs[:, 1]
    elif method.lower() == "cbow":
        x, y = pairs[:, :-1], pairs[:, -1]
    else:
        raise ValueError
    return Dataset(x, y, v2i, i2v)

from tensorflow import keras
import tensorflow as tf

corpus = [
    # numbers
    "5 2 4 8 6 2 3 6 4",
    "4 8 5 6 9 5 5 6",
    "1 1 5 2 3 3 8",
    "3 6 9 6 8 7 4 6 3",
    "8 9 9 6 1 4 3 4",
    "1 0 2 0 2 1 3 3 3 3 3",
    "9 3 3 0 1 4 7 8",
    "9 9 8 5 6 7 1 2 3 0 1 0",

    # alphabets, expecting that 9 is close to letters
    "a t g q e h 9 u f",
    "e q y u o i p s",
    "q o 9 p l k j o k k o p",
    "h g y i u t t a e q",
    "i k d q r e 9 e a d",
    "o p d g 9 s a f g a",
    "i u y g h k l a s w",
    "o l u y a o g f s",
    "o p i u y g d a s j d l",
    "u k i l o 9 l j s",
    "y g i s h k j l f r f",
    "i o h n 9 9 d 9 f a 9",
]


class CBOW(keras.Model):
    def __init__(self, v_dim, emb_dim): #emb_dim=2,特征大小
        super().__init__()
        self.v_dim = v_dim
        self.embeddings = keras.layers.Embedding(
            input_dim=v_dim, output_dim=emb_dim,  # [n_vocab, emb_dim]
            embeddings_initializer=keras.initializers.RandomNormal(0., 0.1),
        )

        # noise-contrastive estimation
        self.nce_w = self.add_weight(
            name="nce_w", shape=[v_dim, emb_dim],
            initializer=keras.initializers.TruncatedNormal(0., 0.1))  # [n_vocab, emb_dim]
        self.nce_b = self.add_weight(
            name="nce_b", shape=(v_dim,),
            initializer=keras.initializers.Constant(0.1))  # [n_vocab, ]

        self.opt = keras.optimizers.Adam(0.01)

    def call(self, x, training=None, mask=None):
        # x.shape = [n, skip_window*2]
        o = self.embeddings(x)          # [n, skip_window*2, emb_dim]
        o = tf.reduce_mean(o, axis=1)   # [n, emb_dim]
        return o

    # negative sampling: take one positive label and num_sampled negative labels to compute the loss
    # in order to reduce the computation of full softmax
    def loss(self, x, y, training=None):
        embedded = self.call(x, training) # [batch,emb_dim]
        return tf.reduce_mean(
            tf.nn.nce_loss(
                weights=self.nce_w, biases=self.nce_b, labels=tf.expand_dims(y, axis=1), #labels=[batch,1]
                inputs=embedded, num_sampled=5, num_classes=self.v_dim))

    def step(self, x, y):
        with tf.GradientTape() as tape:
            loss = self.loss(x, y, True)
            grads = tape.gradient(loss, self.trainable_variables)
        self.opt.apply_gradients(zip(grads, self.trainable_variables))
        return loss.numpy()


def train(model, data):
    for t in range(2500):
        bx, by = data.sample(8)
        loss = model.step(bx, by)
        if t % 200 == 0:
            print("step: {} | loss: {}".format(t, loss))


if __name__ == "__main__":
    d = process_w2v_data(corpus, skip_window=2, method="cbow") #return Dataset(x, y, v2i, i2v)
    m = CBOW(d.num_word, 2)
    train(m, d)

    # plotting
    show_w2v_word_embedding(m, d, "./result/cbow.png")