Python|使用 Word2Vec 进行词嵌入

Word Embedding是一种语言建模技术，用于将单词映射到实数向量。它表示具有多个维度的向量空间中的单词或短语。词嵌入可以使用多种方法生成，如神经网络、共现矩阵、概率模型等。

Word2Vec由用于生成词嵌入的模型组成。这些模型是浅层的两层神经网络，具有一层输入层、一层隐藏层和一层输出层。 Word2Vec 使用两种架构：

CBOW（Continuous Bag of Words）： CBOW 模型在特定窗口内给定上下文词预测当前词。输入层包含上下文词，输出层包含当前词。隐藏层包含我们想要在输出层表示当前单词的维数。
Skip Gram： Skip gram 在给定当前单词的特定窗口内预测周围的上下文单词。输入层包含当前单词，输出层包含上下文单词。隐藏层包含我们想要表示输入层当前单词的维数。

词嵌入的基本思想是出现在相似上下文中的词在向量空间中往往更接近。为了在Python中生成词向量，需要的模块是nltk和gensim 。

在终端中运行这些命令来安装nltk和gensim ：

pip install nltk
pip install gensim

从这里下载用于生成词向量的文本文件。

下面是实现：

# Python program to generate word vectors using Word2Vec
  
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
  
warnings.filterwarnings(action = 'ignore')
  
import gensim
from gensim.models import Word2Vec
  
#  Reads ‘alice.txt’ file
sample = open("C:\\Users\\Admin\\Desktop\\alice.txt", "r")
s = sample.read()
  
# Replaces escape character with space
f = s.replace("\n", " ")
  
data = []
  
# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []
      
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
  
    data.append(temp)
  
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1, 
                              size = 100, window = 5)
  
# Print results
print("Cosine similarity between 'alice' " + 
               "and 'wonderland' - CBOW : ",
    model1.similarity('alice', 'wonderland'))
      
print("Cosine similarity between 'alice' " +
                 "and 'machines' - CBOW : ",
      model1.similarity('alice', 'machines'))
  
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, size = 100,
                                             window = 5, sg = 1)
  
# Print results
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ",
    model2.similarity('alice', 'wonderland'))
      
print("Cosine similarity between 'alice' " +
            "and 'machines' - Skip Gram : ",
      model2.similarity('alice', 'machines'))

输出：

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.999249298413
Cosine similarity between 'alice' and 'machines' - CBOW :  0.974911910445
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.885471373104
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.856892599521

输出表示不同模型的词向量“alice”、“wonderland”和“machines”之间的余弦相似度。一项有趣的任务可能是更改“大小”和“窗口”的参数值以观察余弦相似度的变化。

Applications of Word Embedding :

>> Sentiment Analysis
>> Speech Recognition
>> Information Retrieval
>> Question Answering

参考：

https://en.wikipedia.org/wiki/Word_embedding
https://en.wikipedia.org/wiki/Word2vec