在Python使用通用句子编码器进行词嵌入

与将单词表示为向量的词嵌入技术不同，在 Sentence Embeddings 中，整个句子或文本及其语义信息被映射到实数向量中。这种技术可以理解和处理整个文本的有用信息，然后可以更好地理解句子的上下文或含义。

在本文中，您将了解如何使用Universal Sentence Encoder为一个完整的句子创建向量。

例如：

让我们考虑两句话： –

你几岁？
你几岁？

以上两句意思相似，即我们试图询问此人的年龄。在上面的两个句子中，单个单词及其向量无法很好地洞察一个完整的句子试图传达的内容，也无法对这两个句子是否相似进行分类。所以在这种情况下，句子嵌入比词嵌入表现更好。

有各种句子嵌入技术，如 Doc2Vec、SentenceBERT、Universal Sentence Encoder 等。

通用句子编码器

Universal Sentence Encoder 将整个句子或文本编码为实数向量，可用于聚类、句子相似性、文本分类和其他自然语言处理 (NLP) 任务。预训练模型可在 Apache-2.0 许可下获得。预训练模型使用深度平均网络 ( DAN ) 编码器对大于字长的文本、句子、短语、段落等进行训练。

使用 Universal Sentence Encoder 实现句子嵌入：

在终端中运行代码之前运行这些命令以安装必要的库。

pip install “tensorflow>=2.0.0”

pip install –upgrade tensorflow-hub

编程需要懂一点英语

程序：

Python3

# import necessary libraries
import tensorflow_hub as hub
  
# Load pre-trained universal sentence encoder model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
  
# Sentences for which you want to create embeddings,
# passed as an array in embed()
Sentences = [
    "How old are you",
    "What is your age",
    "I love to watch Television",
    "I am wearing a wrist watch"
]
embeddings = embed(Sentences)
  
# Printing embeddings of each sentence
print(embeddings)
  
# To print each embeddings along with its corresponding 
# sentence below code can be used.
for i in range(len(Sentences)):
    print(Sentences[i])
    print(embeddings[i])

输出：

tf.Tensor(

[[-0.06045125 -0.00204541 0.02656925 … 0.00764413 -0.02669661

0.05110302]

[-0.08415682 -0.08687923 0.03446117 … -0.01439389 -0.04546221

0.03639965]

[ 0.0816019 -0.01570276 -0.05659245 … -0.07133699 0.11040762

-0.0071095 ]

[-0.00369539 0.03064634 -0.05556112 … 0.01751423 0.0316496

-0.05139377]], shape=(4, 512), dtype=float32)

编程需要懂一点英语

解释：

上面的输出使用通用句子编码器将输入句子表示为相应的向量。