• 目录

 

1. 文本相似度问题与应用

2. 文本相似度模型介绍

3. 实战:基于Python实现编辑距离

4. 实战:基于simhash实现相似文本判断

5. 实战:词向量Word AVG


1. 文本相似度问题与应用

  • 文本相似度问题

文本相似度问题包含:词与词、句与句、段落与段落、篇章与篇章之间的相似度问题;以及词与句、句与段落、段落与篇章等之类的相似度问题,这里的相似指的是语义的相似。这些问题的难度递增。

  • 文本相似度应用

搜索系统:

1)利用query来搜索最相关的文本/网页。

2)利用网页的标题、内容等信息。

问答系统:

用户提问的问题与语料库中的问题进行相似度匹配,选择相似度最高的问题的答案作为回答。

聊天机器人 --- 检索式模型:

利用文本相似度实现问答的聊天机器人例子:

 单看每一轮对话,效果似乎还不错。如果综合多轮对话来看,有些机械,达不到期望的结果。

2. 文本相似度模型介绍

  • Hamming distance

两个相同长度的字符串,有多少个位置是不同的token。 如:d(cap,cat) = 1

距离越小,说明二者越相似;反之,说明差距很大。很显然,这种方法,过于简单,有一些词的词义接近,但完全是不同的两个词,如diffcult,hard等。当然这种方法可能在某种特定的情况下,会有一些作用。

文本相似度强调的是词义、语义的相似,而不是形似。

  • 编辑距离

给定两段文本或两个句子,最少需要经过多少步操作能够从一个句子转化为另一个句子。允许的操作有:

利用动态规划解决编辑距离:

 假设我们比较kitten和sitting这两个单词的相似度(此时操作的基本单位是字符,如果是句子/段落相似度问题的话,基本操作单位就是单词)。

利用动态规划的思想,计算两个字符串的编辑距离,就相当于计算他们子串的编辑距离,再加上从子串到全串需要的最少操作数即可,不断的进行递推。

递推公式如下:

相当于产生了下面的这个编辑距离矩阵:

1)kitten和sitting都是从第一个位置开始,lev_{a,b}(0,0)=0

2) 如果i或j=0,即第0行或第0列,lev_{a,b}(i,j)=max(i,j)

3)当i,j都不为0时

lev_{a,b}(i,j)=min\{lev_{a,b}(i-1,j)+1,lev_{a,b}(i,j-1)+1,lev_{a,b}(i-1,j-1)(if a_i=b_j),lev_{a,b}(i-1,j-1)+1(if a_i\neq b_j)\}

4) 矩阵右下角的3,就是字符串kitten和sitting的编辑距离,即相似度;其斜上方的3,就是子串kitte和sittin的编辑距离,即相似度。

5)当矩阵的第一行和第一列都初始化后,每个子串的编辑距离,都基于其斜上,上和左边的编辑距离通过3)中的公式来计算。比如,看上述矩阵的第四行第六列,它是子串kitt和si的编辑距离 为3,当计算这个值时,首先看斜上方,即kit和s的编辑距离为3,由于a_i \neq b_j,即t和i不同,所以通过3)中的公式计算可得3+1=4;再看上方,即kitt和s的编辑距离为3,kitt和si编辑距离可以通过一步添加操作得到,通过3)中的公式计算可得3+1=4;再看左方,即kit和si的编辑距离为2,kitt和si编辑距离可以通过一步添加操作得到,通过3)中的公式计算可得2+1=3。取三者的最小值,最终kitt和si的编辑距离 为3。

  • Jaccard Similarity

给定两个文本或两句话,把两句话中出现的单词取交集和并集,交集和并集的大小之商即为Jaccard Similarity。

例如:

s1 = "Natural language processing is a promising research area" #8个词
s2 = "More and more researchers are working on natural language processing nowadays" #11 个词

交集:language、processing 2个词。 Jaccard Similarity = 2/(8+11-2) = 0.11764

缺点:只考虑单词出现与否,忽略每个单词的含义,忽略单词顺序,没有考虑单词出现的次数。

  • SimHash

SimHash在搜索引擎中使用比较广泛,当你对关键词进行搜索后,会返回相关的一系列网页,但是互联网上的网页有很多都是高度重复的,所以一个高质量的返回结果,应该不同的,我们不希望返回结果中,前十个网页都是一样的。可以比较一下他们之间的Simhash,返回不同的内容。

1)选择一个hashsize,如32

2)初始化向量V = [0]*32

3) 把一段文本text变成features,如:

可以选择去掉原始文本中的空格,也可以不去。上图中生成的features,其实就是对原始文本,每连续三个字符取一个feature(3元组)。

4)把每个feature(三元组)hash(具体hash算法不做详细展开)成32位,即一个大小为32的向量,向量中的每个值是0/1.

5)对于每个feature的hash结果的每个位置,如果该位置为1就把向量V的对应位置V[i]+1,如果该位置为0就把向量V的对应位置V[i]-1。

6) 最后查看向量V的各个位置,如果V[i]>0则设置V[i]=1;否则设置V[i]=0。最终得到的这个向量V就是这段文本的simhash。

  • 基于文本特征的相似度计算方法

1)将文本转换为feature vectors。

可以采用bag of words得到feature vectors,向量维度为词典大小,向量的每一维是词典中该位置的词在文本中的出现次数,未在文本中出现则为0。

也可以使用TF-IDF得到feature vectors,向量维度为词典大小,向量的每一维是词典中该位置的词在文本计算的TF-IDF值,未在文本中出现则为0。

2)利用feature vectors计算文本间的相似度。

可以使用余弦相似度,基于两个文本的特征向量,计算他们的相似度:

  • word2Vec

词向量可以用于测量单词之间的相似度,相同语义的单词,其词向量也应该是相似的。对词向量做降维并可视化,可以看到如下图所示的聚类效果,即相近语义的词会聚在一起:

文本或句子相似度问题,可以把句子中每个单词的词向量简单做一个平均,得到的向量作为整个句子的向量表示,再利用余弦相似度计算句子的相似度; 也可以对句子中每个单词的词向量做加权平均,权重可以是每个词的TF-IDF值。

3. 实战:基于Python实现编辑距离

 

def editDistDP(s1, s2): 
    m = len(s1)
    n = len(s2)
    # 创建一张表格记录所有子问题的答案
    dp = [[0 for x in range(n+1)] for x in range(m+1)] 
  
    # 从上往下填充DP表格
    for i in range(m+1): 
        for j in range(n+1): 
  
            # 如果第一个字符串为空,唯一的编辑方法就是添加第二个字符串
            if i == 0: 
                dp[i][j] = j    # Min. operations = j 
  
            # 如果第二个字符串为空,唯一的方法就是删除第一个字符串中的所有字母
            elif j == 0: 
                dp[i][j] = i    # Min. operations = i 
  
            # 如果两个字符串结尾字母相同,我们就可以忽略最后的字母
            elif s1[i-1] == s2[j-1]: 
                dp[i][j] = dp[i-1][j-1] 
  
            # 如果结尾字母不同,那我们就需要考虑三种情况,取最小的编辑距离
            else: 
                dp[i][j] = 1 + min(dp[i][j-1],        # 添加 
                                   dp[i-1][j],        # 删除
                                   dp[i-1][j-1])    # 替换
  
    return dp[m][n] 
s1 = "natural language processing is a promising research area"
s2 = "more researchers are working on natural language processing nowadays"
editDistDP(s1.split(), s2.split()) #输入为两个句子 以词为单位 .split() 空格切分。汉语需要jieba分词。

ww2 = """
World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 50 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
Japan, which aimed to dominate Asia and the Pacific, was at war with China by 1937,[5][b] though neither side had declared war on the other. World War II is generally said to have begun on 1 September 1939,[6] with the invasion of Poland by Germany and subsequent declarations on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Under the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. Following the onset of campaigns in North Africa and East Africa, and the fall of France in mid 1940, the war continued primarily between the European Axis powers and the British Empire. War in the Balkans, the aerial Battle of Britain, the Blitz, and the long Battle of the Atlantic followed. On 22 June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history. This Eastern Front trapped the Axis, most crucially the German Wehrmacht, into a war of attrition. In December 1941, Japan launched a surprise attack on the United States and European colonies in the Pacific. Following an immediate U.S. declaration of war against Japan, supported by one from Great Britain, the European Axis powers quickly declared war on the U.S. in solidarity with their Japanese ally. Rapid Japanese conquests over much of the Western Pacific ensued, perceived by many in Asia as liberation from Western dominance and resulting in the support of several armies from defeated territories.
The Axis advance in the Pacific halted in 1942 when Japan lost the critical Battle of Midway; later, Germany and Italy were defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. Key setbacks in 1943, which included a series of German defeats on the Eastern Front, the Allied invasions of Sicily and Italy, and Allied victories in the Pacific, cost the Axis its initiative and forced it into strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained its territorial losses and turned toward Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in Central China, South China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.
The war in Europe concluded with an invasion of Germany by the Western Allies and the Soviet Union, culminating in the capture of Berlin by Soviet troops, the suicide of Adolf Hitler and the German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, the Soviet entry into the war against Japan and its invasion of Manchuria, Japan announced its intention to surrender on 15 August 1945, cementing total victory in Asia for the Allies. Tribunals were set up by fiat by the Allies and war crimes trials were conducted in the wake of the war both against the Germans and the Japanese.
World War II changed the political alignment and social structure of the globe. The United Nations (UN) was established to foster international co-operation and prevent future conflicts; the victorious great powers—China, France, the Soviet Union, the United Kingdom, and the United States—became the permanent members of its Security Council.[7] The Soviet Union and United States emerged as rival superpowers, setting the stage for the nearly half-century long Cold War. In the wake of European devastation, the influence of its great powers waned, triggering the decolonisation of Africa and Asia. Most countries whose industries had been damaged moved towards economic recovery and expansion. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and create a common identity.[8]"""

ww1 = """World War I (often abbreviated as WWI or WW1), also known as the First World War or the Great War, was a global war originating in Europe that lasted from 28 July 1914 to 11 November 1918. Contemporaneously described as "the war to end all wars",[7] it led to the mobilisation of more than 70 million military personnel, including 60 million Europeans, making it one of the largest wars in history.[8][9] It is also one of the deadliest conflicts in history,[10] with an estimated nine million combatants and seven million civilian deaths as a direct result of the war, while resulting genocides and the 1918 influenza pandemic caused another 50 to 100 million deaths worldwide.[11]
On 28 June 1914, Gavrilo Princip, a Bosnian Serb Yugoslav nationalist, assassinated the Austro-Hungarian heir Archduke Franz Ferdinand in Sarajevo, leading to the July Crisis.[12][13] In response, on 23 July Austria-Hungary issued an ultimatum to Serbia. Serbia's reply failed to satisfy the Austrians, and the two moved to a war footing.
A network of interlocking alliances enlarged the crisis from a bilateral issue in the Balkans to one involving most of Europe. By July 1914, the great powers of Europe were divided into two coalitions: the Triple Entente—consisting of France, Russia and Britain—and the Triple Alliance of Germany, Austria-Hungary and Italy (the Triple Alliance was primarily defensive in nature, allowing Italy to stay out of the war in 1914).[14] Russia felt it necessary to back Serbia and, after Austria-Hungary shelled the Serbian capital of Belgrade on the 28th, partial mobilisation was approved.[15] General Russian mobilisation was announced on the evening of 30 July; on the 31st, Austria-Hungary and Germany did the same, while Germany demanded Russia demobilise within 12 hours.[16] When Russia failed to comply, Germany declared war on 1 August in support of Austria-Hungary, with Austria-Hungary following suit on 6th; France ordered full mobilisation in support of Russia on 2 August.[17]
German strategy for a war on two fronts against France and Russia was to rapidly concentrate the bulk of its army in the West to defeat France within four weeks, then shift forces to the East before Russia could fully mobilise; this was later known as the Schlieffen Plan.[18] On 2 August, Germany demanded free passage through Belgium, an essential element in achieving a quick victory over France.[19] When this was refused, German forces invaded Belgium on 3 August and declared war on France the same day; the Belgian government invoked the 1839 Treaty of London and in compliance with its obligations under this, Britain declared war on Germany on 4 August.[20][21] On 12 August, Britain and France also declared war on Austria-Hungary; on the 23rd, Japan sided with the Entente, seizing German possessions in China and the Pacific. In November 1914, the Ottoman Empire entered the war on the side of the Alliance, opening fronts in the Caucasus, Mesopotamia and the Sinai Peninsula. The war was fought in and drew upon each powers' colonial empires as well, spreading the conflict to Africa and across the globe. The Entente and its allies would eventually become known as the Allied Powers, while the grouping of Austria-Hungary, Germany and their allies would become known as the Central Powers.
The German advance into France was halted at the Battle of the Marne and by the end of 1914, the Western Front settled into a battle of attrition, marked by a long series of trench lines that changed little until 1917 (the Eastern Front, by contrast, was marked by much greater exchanges of territory). In 1915, Italy joined the Allied Powers and opened a front in the Alps. The Kingdom of Bulgaria joined the Central Powers in 1915 and the Kingdom of Greece joined the Allies in 1917, expanding the war in the Balkans. The United States initially remained neutral, although by doing nothing to prevent the Allies from procuring American supplies whilst the Allied blockade effectively prevented the Germans from doing the same the U.S. became an important supplier of war material to the Allies. Eventually, after the sinking of American merchant ships by German submarines, and the revelation that the Germans were trying to incite Mexico to make war on the United States, the U.S. declared war on Germany on 6 April 1917. Trained American forces would not begin arriving at the front in large numbers until mid-1918, but ultimately the American Expeditionary Force would reach some two million troops.[22]
Though Serbia was defeated in 1915, and Romania joined the Allied Powers in 1916 only to be defeated in 1917, none of the great powers were knocked out of the war until 1918. The 1917 February Revolution in Russia replaced the Tsarist autocracy with the Provisional Government, but continuing discontent at the cost of the war led to the October Revolution, the creation of the Soviet Socialist Republic, and the signing of the Treaty of Brest-Litovsk by the new government in March 1918, ending Russia's involvement in the war. This allowed the transfer of large numbers of German troops from the East to the Western Front, resulting in the German March 1918 Offensive. This offensive was initially successful, but the Allies rallied and drove the Germans back in their Hundred Days Offensive.[23] Bulgaria was the first Central Power to sign an armistice—the Armistice of Salonica on 29 September 1918. On 30 October, the Ottoman Empire capitulated, signing the Armistice of Mudros.[24] On 4 November, the Austro-Hungarian empire agreed to the Armistice of Villa Giusti. With its allies defeated, revolution at home, and the military no longer willing to fight, Kaiser Wilhelm abdicated on 9 November and Germany signed an armistice on 11 November 1918.
World War I was a significant turning point in the political, cultural, economic, and social climate of the world. The war and its immediate aftermath sparked numerous revolutions and uprisings. The Big Four (Britain, France, the United States, and Italy) imposed their terms on the defeated powers in a series of treaties agreed at the 1919 Paris Peace Conference, the most well known being the German peace treaty—the Treaty of Versailles.[25] Ultimately, as a result of the war the Austro-Hungarian, German, Ottoman, and Russian Empires ceased to exist, with numerous new states created from their remains. However, despite the conclusive Allied victory (and the creation of the League of Nations during the Peace Conference, intended to prevent future wars), a Second World War would follow just over twenty years later."""

netease = """NetEase, Inc. (simplified Chinese: 网易; traditional Chinese: 網易; pinyin: WǎngYì) is a Chinese Internet technology company providing online services centered on content, community, communications and commerce. The company was founded in 1997 by Lebunto. NetEase develops and operates online PC and mobile games, advertising services, email services and e-commerce platforms in China. It is one of the largest Internet and video game companies in the world.[7]
Some of NetEase's games include the Westward Journey series (Fantasy Westward Journey, Westward Journey Online II, Fantasy Westward Journey II, and New Westward Journey Online II), as well as other games, such as Tianxia III, Heroes of Tang Dynasty Zero and Ghost II. NetEase also partners with Blizzard Entertainment to operate local versions of Warcraft III, World of Warcraft, Hearthstone, StarCraft II, Diablo III: Reaper of Souls and Overwatch in China. They are also developing their very first self-developed VR multiplayer online game with an open world setting, which is called Nostos.[8]"""
print(editDistDP(ww1.split(), ww2.split())) #更相似 距离越小越相似
print(editDistDP(ww1.split(), netease.split()))

 

4. 实战:基于simhash实现相似文本判断

  • Jaccard Similarity
def jaccard_sim(s1, s2):
    a = set(s1.split())  #分词 转换为集合去重
    print(len(a))
    b = set(s2.split()) 
    print(len(b))
    c = a.intersection(b) #交集
    print(len(c))
    print(c)
    return float(len(c)) / (len(a) + len(b) - len(c))

s1 = "Natural language processing is a promising research area"
s2 = "More and more researchers are working on natural language processing nowadays"
print(jaccard_sim(s1, s2))

  • SimHash

原理见上。

# Created by 1e0n in 2013
from __future__ import division, unicode_literals

import re
import sys
import hashlib
import logging
import numbers
import collections
from itertools import groupby

if sys.version_info[0] >= 3:
    basestring = str
    unicode = str
    long = int
else:
    range = xrange


def _hashfunc(x): # 使用的hash函数
    return int(hashlib.md5(x).hexdigest(), 16)


class Simhash(object):

    def __init__(
        self, value, f=64, reg=r'[\w\u4e00-\u9fcc]+', hashfunc=None, log=None
    ):
        """
        `f` is the dimensions of fingerprints

        `reg` is meaningful only when `value` is basestring and describes
        what is considered to be a letter inside parsed string. Regexp
        object can also be specified (some attempt to handle any letters
        is to specify reg=re.compile(r'\w', re.UNICODE))

        `hashfunc` accepts a utf-8 encoded string and returns a unsigned
        integer in at least `f` bits.
        """

        self.f = f
        self.reg = reg
        self.value = None

        if hashfunc is None:
            self.hashfunc = _hashfunc
        else:
            self.hashfunc = hashfunc

        if log is None:
            self.log = logging.getLogger("simhash")
        else:
            self.log = log


        if isinstance(value, Simhash):
            self.value = value.value
        elif isinstance(value, basestring):
#             print("build by text")
            self.build_by_text(unicode(value))
        elif isinstance(value, collections.Iterable):
            self.build_by_features(value)
        elif isinstance(value, numbers.Integral):
            self.value = value
        else:
            raise Exception('Bad parameter with type {}'.format(type(value)))

    def __eq__(self, other):
        """
        Compare two simhashes by their value.

        :param Simhash other: The Simhash object to compare to
        """
        return self.value == other.value

    def _slide(self, content, width=4):
        return [content[i:i + width] for i in range(max(len(content) - width + 1, 1))]

    def _tokenize(self, content):
        content = content.lower()
        content = ''.join(re.findall(self.reg, content))
        ans = self._slide(content)
        return ans

    def build_by_text(self, content):
        features = self._tokenize(content)
        features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))}
        return self.build_by_features(features)

    def build_by_features(self, features):
        """
        `features` might be a list of unweighted tokens (a weight of 1
                   will be assumed), a list of (token, weight) tuples or
                   a token -> weight dict.
        """
        v = [0] * self.f # 初始化 [0,0,0,...]
        masks = [1 << i for i in range(self.f)] # [1, 10, 100, 1000,10000,...]
        if isinstance(features, dict):
            features = features.items()
        for f in features: 
            if isinstance(f, basestring):
                h = self.hashfunc(f.encode('utf-8')) # hash成32位
                w = 1
            else:
                assert isinstance(f, collections.Iterable)
                h = self.hashfunc(f[0].encode('utf-8'))
                w = f[1]
            for i in range(self.f):
                v[i] += w if h & masks[i] else -w #如果该位置是1就+w,否则-w
        ans = 0
        for i in range(self.f): # 计算结果
            if v[i] > 0: # 如果大于0,就把那一位变成1 和之前不太一样,<=0没有处理
                ans |= masks[i] 
        self.value = ans

    def distance(self, another):
        assert self.f == another.f
        x = (self.value ^ another.value) & ((1 << self.f) - 1) # 异或 对应位置相同为0 不同为1
        ans = 0
        while x:
            ans += 1  #计算2进制表示中1的个数
            x &= x - 1
        return ans #返回距离(1的个数)
def get_features(s): #生成features 
    width = 3  #3元组
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s) #将非单词字符替换为空''
    return [s[i:i+width] for i in range(max(len(s) - width + 1, 1))] 

print(get_features("How are you? I am fine. Thanks. "))
print(Simhash(get_features("How are you? I am fine. Thanks. ")))
print(Simhash(get_features("How are you? I am fine. Thanks. ")).value)
print(hex(Simhash(get_features("How are you? I am fine. Thanks. ")).value))

 

print(Simhash('aa').distance(Simhash('bb')))
print(Simhash('aa').distance(Simhash('aa')))
print(Simhash(get_features("How are you? I am fine. Thanks. ")).distance(Simhash(get_features("How are you? I am fine. Thanks. "))))
print(Simhash(get_features("How are you? I am fine. Thank you. ")).distance(Simhash(get_features("How are you? I am fine. Thanks. "))))

  • SimhashIndex
class SimhashIndex(object):

    def __init__(self, objs, f=64, k=2, log=None):
        """
        `objs` is a list of (obj_id, simhash)
        obj_id is a string, simhash is an instance of Simhash
        `f` is the same with the one for Simhash
        `k` is the tolerance
        """
        self.k = k
        self.f = f
        count = len(objs)

        if log is None:
            self.log = logging.getLogger("simhash")
        else:
            self.log = log

        self.log.info('Initializing %s data.', count)

        self.bucket = collections.defaultdict(set)

        for i, q in enumerate(objs):
            if i % 10000 == 0 or i == count - 1:
                self.log.info('%s/%s', i + 1, count)

            self.add(*q)

    def get_near_dups(self, simhash):
        """
        `simhash` is an instance of Simhash
        return a list of obj_id, which is in type of str
        """
        assert simhash.f == self.f

        ans = set()

        for key in self.get_keys(simhash):
            dups = self.bucket[key]
            self.log.debug('key:%s', key)
            if len(dups) > 200:
                self.log.warning('Big bucket found. key:%s, len:%s', key, len(dups))

            for dup in dups:
                sim2, obj_id = dup.split(',', 1)
                sim2 = Simhash(long(sim2, 16), self.f)

                d = simhash.distance(sim2)
                if d <= self.k:
                    ans.add(obj_id)
        return list(ans)

    def add(self, obj_id, simhash):
        """
        `obj_id` is a string
        `simhash` is an instance of Simhash
        """
        assert simhash.f == self.f

        for key in self.get_keys(simhash):
            v = '%x,%s' % (simhash.value, obj_id)
            self.bucket[key].add(v)

    def delete(self, obj_id, simhash):
        """
        `obj_id` is a string
        `simhash` is an instance of Simhash
        """
        assert simhash.f == self.f

        for key in self.get_keys(simhash):
            v = '%x,%s' % (simhash.value, obj_id)
            if v in self.bucket[key]:
                self.bucket[key].remove(v)

    @property
    def offsets(self):
        """
        You may optimize this method according to <http://www.wwwconference.org/www2007/papers/paper215.pdf>
        """
        return [self.f // (self.k + 1) * i for i in range(self.k + 1)]

    def get_keys(self, simhash):
        for i, offset in enumerate(self.offsets):
            if i == (len(self.offsets) - 1):
                m = 2 ** (self.f - offset) - 1
            else:
                m = 2 ** (self.offsets[i + 1] - offset) - 1
            c = simhash.value >> offset & m
            yield '%x:%x' % (c, i)

    def bucket_size(self):
        return len(self.bucket)
data = {
    1: u'How are you? I am fine. blar blar blar blar blar Thanks.', 
    2: u'How are you i am fine. blar blar blar blar blar Thanks.', 
    3: u'This is a simhash test', 
}

objs = [(str(k), Simhash(get_features(v))) for k, v in data.items()]
index = SimhashIndex(objs, k=3)

print(index.bucket_size())
s1 = Simhash(get_features(u'This is a simhash test'))
print(index.get_near_dups(s1)) #与s1最接近的文本的索引

index.add('4', s1)
print(index.get_near_dups(s1))

 

5. 实战:词向量Word AVG

  • bag of words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def bow_cosine(s1, s2):
    vectorizer = CountVectorizer()
    vectorizer.fit([s1, s2])
    X = vectorizer.transform([s1, s2]) #得到s1,s2用bag of words方式表示的向量
    print(X.toarray())

    print(cosine_similarity(X[0], X[1]))
    
s1 = "Natural language processing is a promising research area "
s2 = "More and more researchers are working on natural language processing nowadays"
bow_cosine(s1, s2)

ww2 = """
World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945. The vast majority of the world's countries—including all the great powers—eventually formed two opposing military alliances: the Allies and the Axis. A state of total war emerged, directly involving more than 100 million people from over 30 countries. The major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 50 to 85 million fatalities, most of whom were civilians in the Soviet Union and China. It included massacres, the genocide of the Holocaust, strategic bombing, premeditated death from starvation and disease, and the only use of nuclear weapons in war.[1][2][3][4]
Japan, which aimed to dominate Asia and the Pacific, was at war with China by 1937,[5][b] though neither side had declared war on the other. World War II is generally said to have begun on 1 September 1939,[6] with the invasion of Poland by Germany and subsequent declarations on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Under the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. Following the onset of campaigns in North Africa and East Africa, and the fall of France in mid 1940, the war continued primarily between the European Axis powers and the British Empire. War in the Balkans, the aerial Battle of Britain, the Blitz, and the long Battle of the Atlantic followed. On 22 June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history. This Eastern Front trapped the Axis, most crucially the German Wehrmacht, into a war of attrition. In December 1941, Japan launched a surprise attack on the United States and European colonies in the Pacific. Following an immediate U.S. declaration of war against Japan, supported by one from Great Britain, the European Axis powers quickly declared war on the U.S. in solidarity with their Japanese ally. Rapid Japanese conquests over much of the Western Pacific ensued, perceived by many in Asia as liberation from Western dominance and resulting in the support of several armies from defeated territories.
The Axis advance in the Pacific halted in 1942 when Japan lost the critical Battle of Midway; later, Germany and Italy were defeated in North Africa and then, decisively, at Stalingrad in the Soviet Union. Key setbacks in 1943, which included a series of German defeats on the Eastern Front, the Allied invasions of Sicily and Italy, and Allied victories in the Pacific, cost the Axis its initiative and forced it into strategic retreat on all fronts. In 1944, the Western Allies invaded German-occupied France, while the Soviet Union regained its territorial losses and turned toward Germany and its allies. During 1944 and 1945 the Japanese suffered major reverses in mainland Asia in Central China, South China and Burma, while the Allies crippled the Japanese Navy and captured key Western Pacific islands.
The war in Europe concluded with an invasion of Germany by the Western Allies and the Soviet Union, culminating in the capture of Berlin by Soviet troops, the suicide of Adolf Hitler and the German unconditional surrender on 8 May 1945. Following the Potsdam Declaration by the Allies on 26 July 1945 and the refusal of Japan to surrender under its terms, the United States dropped atomic bombs on the Japanese cities of Hiroshima and Nagasaki on 6 and 9 August respectively. With an invasion of the Japanese archipelago imminent, the possibility of additional atomic bombings, the Soviet entry into the war against Japan and its invasion of Manchuria, Japan announced its intention to surrender on 15 August 1945, cementing total victory in Asia for the Allies. Tribunals were set up by fiat by the Allies and war crimes trials were conducted in the wake of the war both against the Germans and the Japanese.
World War II changed the political alignment and social structure of the globe. The United Nations (UN) was established to foster international co-operation and prevent future conflicts; the victorious great powers—China, France, the Soviet Union, the United Kingdom, and the United States—became the permanent members of its Security Council.[7] The Soviet Union and United States emerged as rival superpowers, setting the stage for the nearly half-century long Cold War. In the wake of European devastation, the influence of its great powers waned, triggering the decolonisation of Africa and Asia. Most countries whose industries had been damaged moved towards economic recovery and expansion. Political integration, especially in Europe, emerged as an effort to end pre-war enmities and create a common identity.[8]"""

ww1 = """World War I (often abbreviated as WWI or WW1), also known as the First World War or the Great War, was a global war originating in Europe that lasted from 28 July 1914 to 11 November 1918. Contemporaneously described as "the war to end all wars",[7] it led to the mobilisation of more than 70 million military personnel, including 60 million Europeans, making it one of the largest wars in history.[8][9] It is also one of the deadliest conflicts in history,[10] with an estimated nine million combatants and seven million civilian deaths as a direct result of the war, while resulting genocides and the 1918 influenza pandemic caused another 50 to 100 million deaths worldwide.[11]
On 28 June 1914, Gavrilo Princip, a Bosnian Serb Yugoslav nationalist, assassinated the Austro-Hungarian heir Archduke Franz Ferdinand in Sarajevo, leading to the July Crisis.[12][13] In response, on 23 July Austria-Hungary issued an ultimatum to Serbia. Serbia's reply failed to satisfy the Austrians, and the two moved to a war footing.
A network of interlocking alliances enlarged the crisis from a bilateral issue in the Balkans to one involving most of Europe. By July 1914, the great powers of Europe were divided into two coalitions: the Triple Entente—consisting of France, Russia and Britain—and the Triple Alliance of Germany, Austria-Hungary and Italy (the Triple Alliance was primarily defensive in nature, allowing Italy to stay out of the war in 1914).[14] Russia felt it necessary to back Serbia and, after Austria-Hungary shelled the Serbian capital of Belgrade on the 28th, partial mobilisation was approved.[15] General Russian mobilisation was announced on the evening of 30 July; on the 31st, Austria-Hungary and Germany did the same, while Germany demanded Russia demobilise within 12 hours.[16] When Russia failed to comply, Germany declared war on 1 August in support of Austria-Hungary, with Austria-Hungary following suit on 6th; France ordered full mobilisation in support of Russia on 2 August.[17]
German strategy for a war on two fronts against France and Russia was to rapidly concentrate the bulk of its army in the West to defeat France within four weeks, then shift forces to the East before Russia could fully mobilise; this was later known as the Schlieffen Plan.[18] On 2 August, Germany demanded free passage through Belgium, an essential element in achieving a quick victory over France.[19] When this was refused, German forces invaded Belgium on 3 August and declared war on France the same day; the Belgian government invoked the 1839 Treaty of London and in compliance with its obligations under this, Britain declared war on Germany on 4 August.[20][21] On 12 August, Britain and France also declared war on Austria-Hungary; on the 23rd, Japan sided with the Entente, seizing German possessions in China and the Pacific. In November 1914, the Ottoman Empire entered the war on the side of the Alliance, opening fronts in the Caucasus, Mesopotamia and the Sinai Peninsula. The war was fought in and drew upon each powers' colonial empires as well, spreading the conflict to Africa and across the globe. The Entente and its allies would eventually become known as the Allied Powers, while the grouping of Austria-Hungary, Germany and their allies would become known as the Central Powers.
The German advance into France was halted at the Battle of the Marne and by the end of 1914, the Western Front settled into a battle of attrition, marked by a long series of trench lines that changed little until 1917 (the Eastern Front, by contrast, was marked by much greater exchanges of territory). In 1915, Italy joined the Allied Powers and opened a front in the Alps. The Kingdom of Bulgaria joined the Central Powers in 1915 and the Kingdom of Greece joined the Allies in 1917, expanding the war in the Balkans. The United States initially remained neutral, although by doing nothing to prevent the Allies from procuring American supplies whilst the Allied blockade effectively prevented the Germans from doing the same the U.S. became an important supplier of war material to the Allies. Eventually, after the sinking of American merchant ships by German submarines, and the revelation that the Germans were trying to incite Mexico to make war on the United States, the U.S. declared war on Germany on 6 April 1917. Trained American forces would not begin arriving at the front in large numbers until mid-1918, but ultimately the American Expeditionary Force would reach some two million troops.[22]
Though Serbia was defeated in 1915, and Romania joined the Allied Powers in 1916 only to be defeated in 1917, none of the great powers were knocked out of the war until 1918. The 1917 February Revolution in Russia replaced the Tsarist autocracy with the Provisional Government, but continuing discontent at the cost of the war led to the October Revolution, the creation of the Soviet Socialist Republic, and the signing of the Treaty of Brest-Litovsk by the new government in March 1918, ending Russia's involvement in the war. This allowed the transfer of large numbers of German troops from the East to the Western Front, resulting in the German March 1918 Offensive. This offensive was initially successful, but the Allies rallied and drove the Germans back in their Hundred Days Offensive.[23] Bulgaria was the first Central Power to sign an armistice—the Armistice of Salonica on 29 September 1918. On 30 October, the Ottoman Empire capitulated, signing the Armistice of Mudros.[24] On 4 November, the Austro-Hungarian empire agreed to the Armistice of Villa Giusti. With its allies defeated, revolution at home, and the military no longer willing to fight, Kaiser Wilhelm abdicated on 9 November and Germany signed an armistice on 11 November 1918.
World War I was a significant turning point in the political, cultural, economic, and social climate of the world. The war and its immediate aftermath sparked numerous revolutions and uprisings. The Big Four (Britain, France, the United States, and Italy) imposed their terms on the defeated powers in a series of treaties agreed at the 1919 Paris Peace Conference, the most well known being the German peace treaty—the Treaty of Versailles.[25] Ultimately, as a result of the war the Austro-Hungarian, German, Ottoman, and Russian Empires ceased to exist, with numerous new states created from their remains. However, despite the conclusive Allied victory (and the creation of the League of Nations during the Peace Conference, intended to prevent future wars), a Second World War would follow just over twenty years later."""

netease = """NetEase, Inc. (simplified Chinese: 网易; traditional Chinese: 網易; pinyin: WǎngYì) is a Chinese Internet technology company providing online services centered on content, community, communications and commerce. The company was founded in 1997 by Lebunto. NetEase develops and operates online PC and mobile games, advertising services, email services and e-commerce platforms in China. It is one of the largest Internet and video game companies in the world.[7]
Some of NetEase's games include the Westward Journey series (Fantasy Westward Journey, Westward Journey Online II, Fantasy Westward Journey II, and New Westward Journey Online II), as well as other games, such as Tianxia III, Heroes of Tang Dynasty Zero and Ghost II. NetEase also partners with Blizzard Entertainment to operate local versions of Warcraft III, World of Warcraft, Hearthstone, StarCraft II, Diablo III: Reaper of Souls and Overwatch in China. They are also developing their very first self-developed VR multiplayer online game with an open world setting, which is called Nostos.[8]"""

bow_cosine(ww1, ww2)
bow_cosine(ww1, netease)

 

  • TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_cosine(s1, s2):
    vectorizer = TfidfVectorizer()
    vectorizer.fit([s1, s2])
    X = vectorizer.transform([s1, s2])#得到s1,s2用TF-IDF方式表示的向量
    print(X.toarray())
    print(cosine_similarity(X[0], X[1]))
    
tfidf_cosine(s1, s2)

 

tfidf_cosine(ww1, ww2)
tfidf_cosine(ww1, netease)

 

  • Word2Vec
import gensim
import gensim.downloader as api
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

model = api.load("glove-twitter-25") #加载基于twitter数据预训练的词向量 大小为25
#一般使用大小为300的词向量,这里用25方便演示

print(model.get_vector("dog"))
print(model.get_vector("dog").shape)
print(model.most_similar("cat")) 

def wordavg(model,words): #对句子中的每个词的词向量简单做平均 作为句子的向量表示
    return np.mean([model.get_vector(word) for word in words],axis=0)
s1 = "Natural language processing is a promising research area "
s2 = "More and more researchers are working on natural language processing nowadays"
s1 = wordavg(model,s1.lower().split()) #中文需要分词
s2 = wordavg(model,s2.lower().split())
print(cosine_similarity(s1.reshape(1,-1),s2.reshape(1,-1))) #用2维数组表示行向量

 由于直接平均的方式比较简单,所以用这种方式表示句向量效果并不是很好。当然,不能单单只看绝对的余弦相似度数值,一种更好的做法是,准备多对数据,人工为其评判一个相似度,并排序;然后用上述方式计算每对数据的相似度,并排序。比较两个排序序列(可以通过斯皮尔曼系数),来进行最终效果的评定。

 

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐