使用 Word2Vec 嵌入从给定单词中查找单词类比

在许多分班考试中，我们经常会遇到一个基本的问题来寻找单词类比。在单词类比任务中，我们完成句子“ a is to b as c is to ___ ”，这通常表示为a : b :: c : d并且我们必须找到单词 'd'。一个示例问题可以是这样的：“男人之于女人，就像国王之于___ ”。

人脑可以识别出空白处必须填上“女王”这个词。但是要让机器理解这种模式并用最合适的词填空，需要进行大量训练。如果我们可以使用机器学习算法来自动完成这个寻找词类比的任务会怎样。在本教程中，我们将使用Word2Vec模型和一个名为“ GoogleNews-vectors-negative300.bin ”的预训练模型，该模型由 Google 使用超过 500 亿个单词进行训练。预训练数据集中的每个单词都嵌入到一个 300 维的空间中，并且上下文/含义相似的单词在空间中彼此靠得更近。

找出类似词的方法：

在这个问题中，我们的目标是找到一个词d ，使得关联词向量va, vb, vc, vd以下列关系相互关联：' vb – va = vd – vc '。我们将使用余弦相似度来衡量vb-va和vd-vc之间的相似度。

导入重要的库：

我们需要安装一个额外的gensim库，使用 word2vec 模型，在终端/命令提示符下使用命令“ pip install gensim ”安装 gensim。

Python3

import numpy as np
import gensim
from gensim.models import word2vec,KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

Python3

vector_word_notations = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

Python3

def analogous_word(word_1,word_2,word_3,vector_word_notations):
    ''' The function accepts a triad of words, word_1, word_2, word_3 and returns word_4 such that word_1:word_2::word_3:word_4 '''
      
    # converting each word to its lowercase
    word_1,word_2,word_3 = word_1.lower(),word_2.lower(),word_3.lower()
      
    # Similarity between |word_2-word_1| = |word_4-word_3| should be maximum
    maximum_similarity = -99999
      
    word_4 = None
      
    words = vector_word_notations.vocab.keys()
      
    va,vb,vc = vector_word_notations[word_1],\
    vector_word_notations[word_2],vector_word_notations[word_3]
      
    # to find word_4 such that similarity
    # (|word_2 - word_1|, |word_4 - word_3|) should be maximum
      
    for i in words:
        if i in [word_1,word_2,word_3]:
            continue
          
        wvec = vector_word_notations[i]
        similarity = cosine_similarity(,[wvec-vc])
          
        if similarity > maximum_similarity:
            maximum_similarity = similarity
            word_4 = i     
  
    return word_4

Python3

triad_1 = ("Man","Woman","King")
# *triad_1 is written to unpack the elements in the tuple
output = analogous_word(*triad_1,word_vectors) 
print(output)
  
# The output will be shown as queen

使用预训练模型加载词向量：

蟒蛇3

vector_word_notations = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

定义一个函数来预测相似词：

蟒蛇3

def analogous_word(word_1,word_2,word_3,vector_word_notations):
    ''' The function accepts a triad of words, word_1, word_2, word_3 and returns word_4 such that word_1:word_2::word_3:word_4 '''
      
    # converting each word to its lowercase
    word_1,word_2,word_3 = word_1.lower(),word_2.lower(),word_3.lower()
      
    # Similarity between |word_2-word_1| = |word_4-word_3| should be maximum
    maximum_similarity = -99999
      
    word_4 = None
      
    words = vector_word_notations.vocab.keys()
      
    va,vb,vc = vector_word_notations[word_1],\
    vector_word_notations[word_2],vector_word_notations[word_3]
      
    # to find word_4 such that similarity
    # (|word_2 - word_1|, |word_4 - word_3|) should be maximum
      
    for i in words:
        if i in [word_1,word_2,word_3]:
            continue
          
        wvec = vector_word_notations[i]
        similarity = cosine_similarity(,[wvec-vc])
          
        if similarity > maximum_similarity:
            maximum_similarity = similarity
            word_4 = i     
  
    return word_4

测试我们的模型：

蟒蛇3

triad_1 = ("Man","Woman","King")
# *triad_1 is written to unpack the elements in the tuple
output = analogous_word(*triad_1,word_vectors) 
print(output)
  
# The output will be shown as queen