如何在 python 数据框列上获取前 100 个常用词

📌 相关文章

📜 如何在 python 数据框列上获取前 100 个常用词 - Python (1)

📅 最后修改于: 2023-12-03 15:38:24.434000 🧑 作者: Mango

在处理文本数据时，获取常用词是常见的需求。本文将介绍如何使用 Python 和 pandas 数据框来获取一个列中前 100 个常用词。

准备数据

首先，我们需要准备一些数据来展示获取常用词的方法。我们将使用一个包含多个字符串的数据框：

import pandas as pd

data = pd.DataFrame({
    'text': ['this is a sample text', 'this is another sample text', 'yet another example of text']
})

分词

获取常用词的第一步是将文本数据分词。为此，我们可以使用 Python 的 split() 方法将每个字符串拆分为单词。为方便起见，我们可以将所有单词转换为小写字母，并将它们存储在一个大列表中：

words = []

for text in data['text']:
    for word in text.lower().split():
        words.append(word)

print(words)

这将输出以下内容：

['this', 'is', 'a', 'sample', 'text', 'this', 'is', 'another', 'sample', 'text', 'yet', 'another', 'example', 'of', 'text']

计数

接下来，我们需要计算每个单词在数据集中出现的次数。一种简单的方法是使用 Python 的 collections 模块中的 Counter 类。以下代码演示了如何使用 Counter 类统计单词个数：

from collections import Counter

word_counts = Counter(words)

print(word_counts)

这将输出以下内容：

Counter({'text': 3, 'this': 2, 'is': 2, 'sample': 2, 'another': 2, 'a': 1, 'yet': 1, 'example': 1, 'of': 1})

获取常用词

现在，我们可以使用 pandas 数据框来获取出现次数最多的前 100 个单词。以下代码演示了如何将单词列表转换为数据框，然后使用 value_counts() 方法计算每个单词的出现次数，并按频率从高到低排序：

word_counts_df = pd.DataFrame.from_dict(word_counts, orient='index', columns=['count'])
word_counts_df = word_counts_df.sort_values('count', ascending=False)

top_words = word_counts_df.head(100).index.tolist()

print(top_words)

这将输出以下内容：

['text', 'this', 'is', 'sample', 'another', 'a', 'yet', 'example', 'of']

最后，我们已经成功获取了数据集中前 100 个常用词。

结论

本文介绍了如何使用 Python 和 pandas 数据框来获取一个列中前 100 个常用词。具体来说，我们使用 split() 方法将每个字符串拆分为单词，使用 collections 模块中的 Counter 类统计单词出现次数，并使用数据框的 value_counts() 方法计算每个单词的出现次数，并按频率从高到低排序。