📅  最后修改于: 2023-12-03 15:37:38.843000             🧑  作者: Mango
Word2Vec是一种用于生成和训练词向量的模型。词向量是一种将单词表示为向量的方式,让计算机能够识别和处理这些单词。Word2Vec通过学习单词语义和关系来生成词向量。这些向量可以用于各种自然语言处理任务,如文本分类、聚类和检索。
本文将介绍如何在Python中实现skip-gram模型的Word2Vec模型。
在开始之前,我们需要以下几个材料:
这是一些用于处理我们的文本数据的Python代码,其根据词频生成词汇表。
import nltk
nltk.download('punkt')
from collections import Counter
from nltk.tokenize import word_tokenize
def preprocess(text):
tokens = word_tokenize(text.lower())
word_counts = Counter(tokens)
vocab = set(tokens)
return tokens, word_counts, vocab
text = "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."
tokens, word_counts, vocab = preprocess(text)
print(tokens) # ['it', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', ',', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.']
print(word_counts) # Counter({'a': 3, ',': 2, 'in': 2, 'it': 1, 'is': 1, 'truth': 1, 'universally': 1, 'acknowledged': 1, 'that': 1, 'single': 1, 'man': 1, 'possession': 1, 'of': 1, 'good': 1, 'fortune': 1, 'must': 1, 'be': 1, 'want': 1, 'wife': 1, '.': 1})
print(vocab) # {'wife', 'single', 'a', 'of', 'in', ',', '.', 'good', 'man', 'it', 'be', 'truth', 'must', 'fortune', 'universally', 'that', 'acknowledged', 'want', 'possession'}
# 根据词频生成词汇表
vocab_size = 10000
vocab = sorted(word_counts, key=word_counts.get, reverse=True)[:vocab_size]
word2id = {w:i for i, w in enumerate(vocab)}
id2word = {i:w for i, w in enumerate(vocab)}
print(vocab[:20]) # [',', 'a', 'of', 'in', '.', 'it', 'is', 'that', 'man', 'good', 'must', 'be', 'to', 'the', 'and', 'he', 'his', 'you', 'for', 'her']
print(word2id['a']) # 1
print(id2word[1]) # 'a'
下面是一些Python示例代码,它将数据组织为单词对,并为每个单词生成上下文:
import numpy as np
window_size = 2 # 设置上下文窗口大小
word_pairs = []
for i in range(window_size, len(tokens) - window_size):
center_word = tokens[i]
context_words = tokens[i - window_size:i] + tokens[i+1:i+window_size+1]
context_word_ids = [word2id[word] for word in context_words]
center_word_id = word2id[center_word]
word_pairs.append((center_word_id, context_word_ids))
print(word_pairs[:5]) # [(2, [8, 0]), (8, [2, 0, 6]), (0, [2, 8, 6, 4]), (6, [8, 0, 4, 18]), (4, [0, 6, 18, 1])]
我们将使用以下Python代码定义我们的模型。
# 定义神经网络结构
embedding_size = 50
W1 = np.random.rand(len(vocab), embedding_size)
W2 = np.random.rand(embedding_size, len(vocab))
# 定义前向传播过程(输入矩阵到输出矩阵)
def forward(inputs):
hidden = np.dot(W1.T, inputs)
output = np.dot(W2.T, hidden)
return output, hidden
# 定义反向传播过程(从输出矩阵反向到输入矩阵)
def backward(output, hidden, inputs):
# 计算输出矩阵的误差
d_output = output - target
dW2 = np.outer(hidden, d_output)
# 计算输入矩阵的误差
d_hidden = np.dot(W2, d_output)
dW1 = np.outer(inputs, d_hidden)
# 更新参数
W1 += learning_rate * dW1
W2 += learning_rate * dW2
return d_hidden
下列Python代码展示了根据skip-gram模型训练单词向量。
learning_rate = 0.01
epochs = 100
losses = []
for epoch in range(epochs):
loss = 0.0
for center_word, context_words in word_pairs:
inputs = np.zeros(len(vocab))
inputs[center_word] = 1
target = np.zeros(len(vocab))
for ctx_word in context_words:
target[ctx_word] = 1
output, hidden = forward(inputs)
loss += np.sum((output - target) ** 2)
backward(output, hidden, inputs)
losses.append(loss)
if epoch % 10 == 0:
print("Epoch %d, loss=%f" % (epoch, loss))
import matplotlib.pyplot as plt
plt.plot(range(epochs), losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()
下面的代码使用我们训练的模型生成单词向量并查找相似的单词。
def cosine_similarity(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
def get_most_similar(word, word2id, id2word, W1):
word_id = word2id[word]
word_vec = W1[word_id]
similarities = []
for i in range(len(vocab)):
if i != word_id:
similarity = cosine_similarity(word_vec, W1[i])
similarities.append((id2word[i], similarity))
return sorted(similarities, key=lambda x: x[1], reverse=True)[:10]
print(get_most_similar('man', word2id, id2word, W1)) # [('good', 0.5641674598526463), ('a', 0.5086403895228903), ('wife', 0.28910401087639216), ('fortune', 0.26942323711006975), ('truth', -0.040930883922953756), ('universally', -0.1031427549171817), ('acknowledged', -0.1399378333153829), ('must', -0.16033871472613288), ('of', -0.2477080842052626), ('single', -0.28707088548433983)]
到此为止,我们已经讲解了如何使用Python实现skip-gram模型的Word2Vec模型。我们首先需要对文本进行预处理,然后将其转换为单词对。接下来,我们定义了我们的神经网络结构和前向/反向传播过程,最后训练了我们的模型。我们还展示了如何使用模型查找最相似的单词。