📜  Python |使用Gensim提取文本摘要

📅  最后修改于: 2021-04-17 02:54:58             🧑  作者: Mango

摘要是用于各种文本应用程序的有用工具,旨在突出大型语料库中的重要信息。随着网络上信息的爆发, Python提供了一些方便的工具来帮助总结文本。本文概述了所遵循的两种主要方法–提取方法和抽象方法。在本文中,我们将看一个提取摘要的工作示例。

算法 :
下面是在gensim库中实现的称为“ TextRank”的算法,该算法基于PageRank算法对搜索结果进行排名。

  1. 预处理给定的文本。这包括去除停用词,删除标点符号和词干。
  2. 用句子作为顶点制作图。
  3. 该图的边缘表示两个句子在顶点处的相似性。
  4. 在此加权图上运行PageRank算法。
  5. 选择得分最高的顶点并将其附加到摘要中。
  6. 根据比率或字数,确定要选取的顶点数。

代码:根据(a)比率和(b)字数总结维基百科的文章。

Python
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import wikipedia
import en_core_web_sm
 
# Get wiki content.
wikisearch = wikipedia.page("Amitabh Bachchan")
wikicontent = wikisearch.content
nlp = en_core_web_sm.load()
doc = nlp(wikicontent)
 
# Save the wiki content to a file
# (for reference).
f = open("wikicontent.txt", "w")
f.write(wikicontent)
f.close()
 
# Summary (0.5% of the original content).
summ_per = summarize(wikicontent, ratio = 0.05)
print("Percent summary")
print(summ_per)
 
# Summary (200 words)
summ_words = summarize(wikicontent, word_count = 200)
print("Word count summary")
print(summ_words)


输出

Percent summary
Amitabh Bachchan (pronounced [?m??ta?b? ?b?t???n]; born Inquilaab Srivastava;
11 October 1942) is an Indian film actor, film producer, television host, 
occasional playback singer and former politician. He first gained popularity
in the early 1970s for films such as Zanjeer, Deewaar and Sholay, and was
dubbed India's "angry young man" for his on-screen roles in Bollywood.
.
.
.
Apart from National Film Awards, Filmfare Awards and other competitive awards
which Bachchan won for his performances throughout the years, he has been 
awarded several honours for his achievements in the Indian film industry.
Word count summary
Beyond the Indian subcontinent, he also has a large overseas following 
in markets including Africa (such as South Africa), the Middle East 
(especially Egypt), United Kingdom, Russia and parts of the United 
States. Bachchan has won numerous accolades in his career, including 
four National Film Awards as Best Actor and many awards at 
international film festivals and award ceremonies.
.
.
.
After a three year stint in politics from 1984 to 1987, Bachchan 
returned to films in 1988, playing the title role in Shahenshah, 
which was a box office success.