📌  相关文章
📜  使用 Word2Vec 嵌入在给定单词中查找奇数单词

📅  最后修改于: 2022-05-13 01:58:08.779000             🧑  作者: Mango

使用 Word2Vec 嵌入在给定单词中查找奇数单词

在测试个人的逻辑推理能力时, Odd One out问题是最有趣和最容易出现的问题之一。它经常用于许多竞争性考试和安置轮次,因为它检查个人的分析技能和决策能力。在本文中,我们将编写一个Python代码,可用于在给定的一组单词中查找奇数单词。

假设我们有一组单词,例如 Apple、Mango、Orange、Party、Guava,我们必须找到奇数单词。作为人类,我们可以分析和预测 Party 是奇怪的词,因为所有其他词都是水果的名称,但是对于模型来说,要理解这一点并找出这一点是非常困难的。在这里,我们将使用Word2Vec 模型和名为“ GoogleNews-vectors-negative300.bin ”的预训练模型,该模型由 Google 对超过 500 亿个单词进行训练。预训练数据集中的每个词都嵌入到一个 300 维的空间中,上下文/含义相似的词在空间中彼此靠得更近,并具有较高的余弦相似度值。

找出奇数词的方法:

我们将找到所有给定词向量的平均向量,然后我们将每个词向量的共相似度值与平均向量值进行比较,具有最少共相似度的词将是我们的奇数词。

导入重要的库:



我们需要安装一个额外的gensim库,使用 word2vec 模型,在终端/命令提示符下使用命令“ pip install gensim安装 gensim

Python3
import numpy as np
import gensim
from gensim.models import word2vec,KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity


Python3
vector_word_notations = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)


Python3
def odd_word_out(input_words):
    '''The function accepts a list of word and returns the odd word.'''
     
    # Generate all word embeddings for the given list of words
     
    whole_word_vectors = [vector_word_notations[i] for i in input_words]
     
    # average vector for all word vectors
    mean_vector = np.mean(whole_word_vectors,axis=0)
     
    # Iterate over every word and find similarity
    odd_word = None
    minimum_similarity = 99999.0 # Can be any very high value
     
    for i in input_words:
        similarity = cosine_similarity([vector_word_notations[i]],[mean_vector])
        if similarity < minimum_similarity:
            minimum_similarity = similarity
            odd_word = i
     
        print("cosine similarity score between %s and mean_vector is %.3f"%(i,similarity))
     
    print("\nThe odd word is: "+odd_word)


Python3
input_1 = ['apple','mango','juice','party','orange','guava'] # party is odd word
odd_word_out(input_1)


Python
input_2 = ['India','paris','Russia','France','Germany','USA']
# paris is an odd word since it is a capital and other are countries
odd_word_out(input_2)



使用预训练模型加载词向量:

蟒蛇3

vector_word_notations = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)


定义一个函数来预测奇数词:

蟒蛇3



def odd_word_out(input_words):
    '''The function accepts a list of word and returns the odd word.'''
     
    # Generate all word embeddings for the given list of words
     
    whole_word_vectors = [vector_word_notations[i] for i in input_words]
     
    # average vector for all word vectors
    mean_vector = np.mean(whole_word_vectors,axis=0)
     
    # Iterate over every word and find similarity
    odd_word = None
    minimum_similarity = 99999.0 # Can be any very high value
     
    for i in input_words:
        similarity = cosine_similarity([vector_word_notations[i]],[mean_vector])
        if similarity < minimum_similarity:
            minimum_similarity = similarity
            odd_word = i
     
        print("cosine similarity score between %s and mean_vector is %.3f"%(i,similarity))
     
    print("\nThe odd word is: "+odd_word)


测试我们的模型:

蟒蛇3

input_1 = ['apple','mango','juice','party','orange','guava'] # party is odd word
odd_word_out(input_1)

输出:

cosine similarity score between apple and mean_vector is 0.765
cosine similarity score between  mango and mean_vector is 0.808
cosine similarity score between juice and mean_vector is 0.688
cosine similarity score between party and mean_vector is 0.289
cosine similarity score between orange and mean_vector is 0.611
cosine similarity score between guava and mean_vector is 0.790

The odd word is: party

同样,再举一个例子,让我们说:

Python

input_2 = ['India','paris','Russia','France','Germany','USA']
# paris is an odd word since it is a capital and other are countries
odd_word_out(input_2)

输出:

cosine similarity score between India and mean_vector is 0.660 
cosine similarity score between paris and mean_vector is 0.518
cosine similarity score between Russia and mean_vector is 0.691
cosine similarity score between France and mean_vector is 0.758
cosine similarity score between Germany and mean_vector is 0.763     
cosine similarity score between USA and mean_vector is 0.564

The odd word is: paris