什么是Word Embedding?

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

word Embedding其实就是一个对词语进行向量化的高级方法。该方法是一种映射,它把一个词语映射到 Rn

常见的词语向量化模型有one-hot模型、词袋模型,但是这些都不具有上面说的这个特性。

另外 如果使用词袋模型,词语向量化后的结果一般是一个维数特别大的向量,而且这个向量中0元素 特别尤其是one-hot模型。然而通过Word embedding 方法向量化的结果一般某个词只需要用150-300维的实向量表示就可以了。

而Skip-Gram就是实现这种词向量化的一种方法,该方法在对词语进行one-hot向量化的基础上再使用一个简单的分类神经网络来学习各个词的词向量。而里面的Skip-Gram 还考虑到了词的上下文语义,通过词在语料库里面的相邻词语来构成训练样本。skip-Gram中的Skip就是指示了上下文的边界。比如说:I live in china and I like Chinese food. 如果以China为中心词,规定skip=3的词都是china的上下文词, 这就相当于在说 我们认为


<I,china>,<live,china>,<in,china>,<china,and>,<china,I>,<china,like>

<script type="math/tex; mode=display" id="MathJax-Element-4">

, , , , , </script> 这些词语对的含义有联系的,把他们放在一起是有一定的道理的。

接下来,我们就来看看如何提高Skip-Gram 来对词语进行向量化吧。

首先我们定义一个比较简单语料库.

 Pumas are large, cat-like animals which are found in America. When reports came into London Zoo that a wild puma had been spotted forty-five miles south of London, they were not taken seriously. However, as the evidence began to accumulate, experts from the Zoo felt obliged to investigate, for the descriptions given by people who claimed to have seen the puma were extraordinarily similar.The hunt for the puma began in a small village where a woman picking blackberries saw ‘a large cat’ only five yards away from her. It immediately ran away when she saw it, and experts confirmed that a puma will not attack a human being unless it is cornered. The search proved difficult, for the puma was often observed at one place in the morning and at another place twenty miles away in the evening. Wherever it went, it left behind it a trail of dead deer and small animals like rabbits. Paw prints were seen in a number of places and puma fur was found clinging to bushes. Several people complained of “cat-like noises’ at night and a businessman on a fishing trip saw the puma up a tree. The experts were now fully convinced that the animal was a puma, but where had it come from? As no pumas had been reported missing from any zoo in the country, this one must have been in the possession of a private collector and somehow managed to escape. The hunt went on for several weeks, but the puma was not caught. It is disturbing to think that a dangerous wild animal is still at large in the quiet countryside.

来自新概念第三册的一篇文章。
1. 清洗数据,把标点符号去除,提取其中的单词

__author__ = 'jmh081701'
import re
def getWords(data):
    rule=r"([A-Za-z-]+)"
    pattern =re.compile(rule)
    words=pattern.findall(data)
    return words

测试一下,输出前5个单词:

>>words = getWords(data)
>>print (words[0:5]
['Pumas', 'are', 'large', 'cat-like', 'animals']
  1. 找出一个 有多少个不同单词,并统计他们的频数
def enumWords(words):
    rst={}
    for each in words:
        if not each in rst:
            rst.setdefault(each,1)
        else:
            rst[each]+=1
    return rst

enumWords函数统计刚刚的一共有多少个不同的词,返回一个dict,其中key是单词,value是其频数

>>words = getWords(data)
>>words = enumWords(words)
>>print(len(words))
154

共有154个单词

3.初步的数据清理,形成词汇表
依照需要我们可以把那些出现频数特别少的词语去掉,让词汇表不是特别大。
但是,在我们这个例子中,一共才154个单词,倒也没有什么必要把词频少的去掉。

>>vocaburary=list(words)
>>print(vocaburary[0:5])
['accumulate', 'which', 'reported', 'spotted', 'think']

4.使用one -hot 对每个词进行初步向量化
在我们这个语料库,词汇表就154个单词,所以每个单词one-hot的结果是一个154维的向量,其中只有一个分量为1,其它分量为0。



[0,0,0,…,1,…,0,0,0]


第i个分量为1,表示这个词在词汇表中的第i个位置。


例如,我们的的词汇表前5个词为:


['accumulate', 'which', 'reported', 'spotted', 'think']

那么 ‘accumulate’ 进行one-hot后的结果就是:[1,0,0,…,0,0,0,0]
同理‘which’one-hot后的结果就是[0,1,0,…,0,0,0,0]
我们写一个函数来对某个具体的词进行one-hot:

def onehot(word,vocaburary):
    l =len(vocaburary)
    vec=numpy.zeros(shape=[1,l],dtype=numpy.float32)
    index=vocaburary.index(word)
    vec[0][index]=1.0
    return  vec

函数返回一个numpy向量,是一个1xn的列向量。
我们测试一下:

>>print(onehot(vocaburary[0],vocaburary))
[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  
···
···
 0.]]

5.定义上下文的边界skip_sizes
上下文的边界就是规定了一个词在文章中的前skip_sizes和后skip_sizes个词都与这个词在含义上高度相关。一般skip_sizes不会取的很大,取2-5就可以。另外这个skip_size是可以随机的。

def getContext(words):
    rst=[]
    for index,word in enumerate(words):
        skip_size=random.randint(1,5)
        for i in range(max(0,index-skip_size),index):
            rst.append([word,words[i]])
        for i in  range(index+1,min(len(words),index+skip_size)):
            rst.append([word,words[i]])
    return  rst

注意这个的words是getWords()的返回值,传入的是原文。
测试:

>>Context=getContext(getWords(data))
>>print(Context[0:5])
[['Pumas', 'are'], ['are', 'Pumas'], ['are', 'large'], ['are', 'cat-like'], ['large', 'are']]

6.设计训练神经网络结构

首先搞清楚样本的数据格式:

以第5步得到的词对为训练样本。第5的返回值中,每一个都是一个词对,我们把词对的第一个元素当做X,第二元素当做Y,并对他们做one-hot形成训练样本。
例如 ,对于上面的Context的第一个元素:[‘Pumas’,’are’]
我们把’Pumas’ one hot 后的结果作为X,把‘are’进行one-hot的结果作为Y.

>>Context=getContext(getWords(data))
>>X=onehot(Context[0][0],vocaburary)
>>Y=onehot(Context[0][1],vocaburary)
>>print(X)
[[ 0.  
···  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  ···
]]
>>print(Y)
[[ 0. ....  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.   0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

ok,我们可以知道 输入和输出其实都是one-hot后的两个154维的列向量。
接下来我们设计一个简单的多层感知机,我们希望用一个30维的向量去表示个各个词。

输入层:1 x 154
隐含层:权值矩阵:W shape:154x30
偏置:b shape:1x30
激活函数 :无
输出层:权值矩阵:W’ shape:30x154
偏置:b’ shape:1x154
激活函数:softmax

然后通过计算,隐含层的W就是我们的要的。其中矩阵的第i行的行向量,即是在词汇表中第i个词的向量表达。