NLP 中的词袋 (BoW) 模型

在本文中，我们将讨论一种文本建模的自然语言处理技术，称为词袋模型。每当我们在 NLP 中应用任何算法时，它都适用于数字。我们不能直接将我们的文本输入该算法。因此，词袋模型用于通过将文本转换为词袋来对文本进行预处理，从而记录最常用词的总出现次数。

这个模型可以使用一个表格来可视化，其中包含对应于单词本身的单词计数。

应用词袋模型：

让我们将这个示例段落用于我们的任务：

Beans. I was trying to explain to somebody as we were flying in, that’s corn. That’s beans. And they were very impressed at my agricultural knowledge. Please give it up for Amaury once again for that outstanding introduction. I have a bunch of good friends here today, including somebody who I served with, who is one of the finest senators in the country, and we’re lucky to have him, your Senator, Dick Durbin is here. I also noticed, by the way, former Governor Edgar here, who I haven’t seen in a long time, and somehow he has not aged and I have. And it’s great to see you, Governor. I want to thank President Killeen and everybody at the U of I System for making it possible for me to be here today. And I am deeply honored at the Paul Douglas Award that is being given to me. He is somebody who set the path for so much outstanding public service here in Illinois. Now, I want to start by addressing the elephant in the room. I know people are still wondering why I didn’t speak at the commencement.

编程需要懂一点英语

步骤#1：我们将首先对数据进行预处理，以便：

将文本转换为小写。
删除所有非单词字符。
删除所有标点符号。

# Python3 code for preprocessing text
import nltk
import re
import numpy as np
  
# execute the text here as :
# text = """ # place text here  """
dataset = nltk.sent_tokenize(text)
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()
    dataset[i] = re.sub(r'\W', ' ', dataset[i])
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])

输出：

预处理文本

您可以进一步预处理文本以满足您的需要。

第 2 步：获取文本中出现频率最高的单词。

我们将应用以下步骤来生成我们的模型。

我们声明一个字典来保存我们的词袋。

接下来，我们将每个句子标记为单词。

现在对于句子中的每个单词，我们检查该单词是否存在于我们的字典中。

如果是，那么我们将其计数加 1。如果不是，我们将其添加到我们的字典中并将其计数设置为 1。

# Creating the Bag of Words model
word2count = {}
for data in dataset:
    words = nltk.word_tokenize(data)
    for word in words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

输出：

词袋词典

在我们的模型中，我们总共有 118 个单词。然而，在处理大文本时，字数可能达到数百万。我们不需要使用所有这些词。因此，我们选择特定数量的最常用词。为了实现这一点，我们使用：

import heapq
freq_words = heapq.nlargest(100, word2count, key=word2count.get)

其中 100 表示我们想要的单词数。如果我们的文本很大，我们会输入更大的数字。

100 个最常用的词

第 3 步：构建词袋模型
在这一步中，我们构造了一个向量，它会告诉我们每个句子中的一个词是否是常用词。如果句子中的单词是常用词，我们将其设置为 1，否则我们将其设置为 0。
这可以在以下代码的帮助下实现：

X = []
for data in dataset:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(data):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
X = np.asarray(X)

输出：

弓模型