Python|文本摘要器
如今,各种组织,无论是在线购物、政府和私营部门组织、餐饮和旅游业还是其他提供客户服务的机构,都会关心他们的客户,并在我们每次使用他们的服务时征求反馈。考虑一个事实,这些公司可能每天都会收到大量的用户反馈。管理层坐下来分析每一个都会变得非常乏味。
但是,今天的技术已经达到了可以完成人类所有任务的程度。使这些事情发生的领域是机器学习。这些机器已经能够使用自然语言处理来理解人类语言。今天的研究正在文本分析领域进行。
文本分析和 NLP 的此类应用之一是反馈总结器,它有助于总结和缩短用户反馈中的文本。这可以通过一种算法来减少文本主体,但保留其原始含义,或者对原始文本进行深入了解。
如果您对数据分析感兴趣,您会发现学习自然语言处理非常有用。 Python为 NLP 提供了巨大的库支持。我们将使用NLTK——自然语言工具包。这将符合我们的目的。
使用以下命令在您的系统上安装 NLTK 模块:
sudo pip install nltk
让我们了解这些步骤 -
第 1 步:导入所需的库
有两个 NLTK 库对于构建高效的反馈汇总器是必需的。
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
使用条款:
- 语料库
语料库是文本的集合。它可以是任何包含文本的数据集,例如某个诗人的诗歌、某个作者的作品等。在这种情况下,我们将使用预先确定的停用词的数据集。
它将文本分成一系列标记。有三个主要的分词器——单词、句子和正则表达式分词器。我们将只使用单词和句子标记器
第 2 步:删除停用词并将它们存储在单独的单词数组中。
停止词
任何像 (is, a, an, the, for) 这样的词都不会增加句子的意义。例如,假设我们有句子
GeeksForGeeks is one of the most useful websites for competitive programming.
去除停用词后,我们可以缩小词的数量并保留如下含义:
['GeeksForGeeks', 'one', 'useful', 'website', 'competitive', 'programming', '.']
第三步:创建词频表
一个Python字典,它会记录删除停用词后每个单词在反馈中出现的次数。我们可以在每个句子上使用字典来了解哪些句子在整个文本中具有最相关的内容。
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
freqTable = dict()
第 4 步:根据每个句子包含的单词和频率表为每个句子分配分数
我们可以使用sent_tokenize()方法来创建句子数组。其次,我们需要一个字典来保存每个句子的分数,稍后我们将通过字典生成摘要。
sentences = sent_tokenize(text)
sentenceValue = dict()
步骤 5:分配一定的分数来比较反馈中的句子。
比较我们的分数的一个简单方法是找到一个句子的平均分数。平均值本身可以是一个很好的阈值。
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
average = int(sumValues / len(sentenceValue))
应用阈值并将句子按顺序存储到摘要中。
代码:使用Python完成 Text Summarizer 的实现
# importing libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
# Input text - to summarize
text = """ """
# Tokenizing the text
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
# Creating a frequency table to keep the
# score of each word
freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1
# Creating a dictionary to keep the score
# of each sentence
sentences = sent_tokenize(text)
sentenceValue = dict()
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
# Average value of a sentence from the original text
average = int(sumValues / len(sentenceValue))
# Storing sentences into our summary.
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence
print(summary)
输入:
There are many techniques available to generate extractive summarization to keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. It’s good to understand Cosine similarity to make the best use of the code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Its measures cosine of the angle between vectors. The angle will be 0 if sentences are similar.
输出
There are many techniques available to generate extractive summarization. Summarization can be defined as a task of producing a concise and fluent summary while preserving key information and overall meaning. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.