使用 Word2Vec 嵌入在给定单词中查找奇数单词
在测试个人的逻辑推理能力时, Odd One out问题是最有趣和最容易出现的问题之一。它经常用于许多竞争性考试和安置轮次,因为它检查个人的分析技能和决策能力。在本文中,我们将编写一个Python代码,可用于在给定的一组单词中查找奇数单词。
假设我们有一组单词,例如 Apple、Mango、Orange、Party、Guava,我们必须找到奇数单词。作为人类,我们可以分析和预测 Party 是奇怪的词,因为所有其他词都是水果的名称,但是对于模型来说,要理解这一点并找出这一点是非常困难的。在这里,我们将使用Word2Vec 模型和名为“ GoogleNews-vectors-negative300.bin ”的预训练模型,该模型由 Google 对超过 500 亿个单词进行训练。预训练数据集中的每个词都嵌入到一个 300 维的空间中,上下文/含义相似的词在空间中彼此靠得更近,并具有较高的余弦相似度值。
找出奇数词的方法:
我们将找到所有给定词向量的平均向量,然后我们将每个词向量的共相似度值与平均向量值进行比较,具有最少共相似度的词将是我们的奇数词。
导入重要的库:
我们需要安装一个额外的gensim库,使用 word2vec 模型,在终端/命令提示符下使用命令“ pip install gensim ”安装 gensim 。
Python3
import numpy as np
import gensim
from gensim.models import word2vec,KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity
Python3
vector_word_notations = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
Python3
def odd_word_out(input_words):
'''The function accepts a list of word and returns the odd word.'''
# Generate all word embeddings for the given list of words
whole_word_vectors = [vector_word_notations[i] for i in input_words]
# average vector for all word vectors
mean_vector = np.mean(whole_word_vectors,axis=0)
# Iterate over every word and find similarity
odd_word = None
minimum_similarity = 99999.0 # Can be any very high value
for i in input_words:
similarity = cosine_similarity([vector_word_notations[i]],[mean_vector])
if similarity < minimum_similarity:
minimum_similarity = similarity
odd_word = i
print("cosine similarity score between %s and mean_vector is %.3f"%(i,similarity))
print("\nThe odd word is: "+odd_word)
Python3
input_1 = ['apple','mango','juice','party','orange','guava'] # party is odd word
odd_word_out(input_1)
Python
input_2 = ['India','paris','Russia','France','Germany','USA']
# paris is an odd word since it is a capital and other are countries
odd_word_out(input_2)
使用预训练模型加载词向量:
蟒蛇3
vector_word_notations = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
定义一个函数来预测奇数词:
蟒蛇3
def odd_word_out(input_words):
'''The function accepts a list of word and returns the odd word.'''
# Generate all word embeddings for the given list of words
whole_word_vectors = [vector_word_notations[i] for i in input_words]
# average vector for all word vectors
mean_vector = np.mean(whole_word_vectors,axis=0)
# Iterate over every word and find similarity
odd_word = None
minimum_similarity = 99999.0 # Can be any very high value
for i in input_words:
similarity = cosine_similarity([vector_word_notations[i]],[mean_vector])
if similarity < minimum_similarity:
minimum_similarity = similarity
odd_word = i
print("cosine similarity score between %s and mean_vector is %.3f"%(i,similarity))
print("\nThe odd word is: "+odd_word)
测试我们的模型:
蟒蛇3
input_1 = ['apple','mango','juice','party','orange','guava'] # party is odd word
odd_word_out(input_1)
输出:
cosine similarity score between apple and mean_vector is 0.765
cosine similarity score between mango and mean_vector is 0.808
cosine similarity score between juice and mean_vector is 0.688
cosine similarity score between party and mean_vector is 0.289
cosine similarity score between orange and mean_vector is 0.611
cosine similarity score between guava and mean_vector is 0.790
The odd word is: party
同样,再举一个例子,让我们说:
Python
input_2 = ['India','paris','Russia','France','Germany','USA']
# paris is an odd word since it is a capital and other are countries
odd_word_out(input_2)
输出:
cosine similarity score between India and mean_vector is 0.660
cosine similarity score between paris and mean_vector is 0.518
cosine similarity score between Russia and mean_vector is 0.691
cosine similarity score between France and mean_vector is 0.758
cosine similarity score between Germany and mean_vector is 0.763
cosine similarity score between USA and mean_vector is 0.564
The odd word is: paris