使用Python和 News API 进行报纸抓取
从网站中提取数据主要有两种方法:
- 使用网站的 API(如果存在)。例如,Facebook 有 Facebook Graph API,它允许检索发布在 Facebook 上的数据。
- 访问网页的 HTML 并从中提取有用的信息/数据。这种技术称为网络抓取或网络收获或网络数据提取。
在本文中,我们将使用newsapi
的 API。您可以通过单击此处创建自己的 API 密钥。
例子:让我们确定一个像报纸引用的国家总统这样的人物的关注,让我们以默克尔为例
import pprint
import requests
secret = "Your API"
# Define the endpoint
url = 'https://newsapi.org/v2/everything?'
# Specify the query and
# number of returns
parameters = {
'q': 'merkel', # query phrase
'pageSize': 100, # maximum is 100
'apiKey': secret # your own API key
}
# Make the request
response = requests.get(url,
params = parameters)
# Convert the response to
# JSON format and pretty print it
response_json = response.json()
pprint.pprint(response_json)
输出:
让我们将所有文本合并起来,然后将单词从大到小排序。
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text_combined = ''
for i in response_json['articles']:
if i['description'] != None:
text_combined += i['description'] + ' '
wordcount={}
for word in text_combined.split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v, in sorted(wordcount.items(),
key=lambda words: words[1],
reverse = True):
print(k,v)
输出:
这个评价是模棱两可的,如果我们删除不好或无用的词,我们可以更清楚地说明。让我们定义一些如下所示的 bad_words
bad_words = [“a”, “the”, “of”, “in”, “to”, “and”, “on”, “de”, “with”,
“by”, “at”, “dans”, “ont”, “été”, “les”, “des”, “au”, “et”,
“après”, “avec”, “qui”, “par”, “leurs”, “ils”, “a”, “pour”,
“les”, “on”, “as”, “france”, “eux”, “où”, “son”, “le”, “la”,
“en”, “with”, “is”, “has”, “for”, “that”, “an”, “but”, “be”,
“are”, “du”, “it”, “à”, “had”, “ist”, “Der”, “um”, “zu”, “den”,
“der”, “-“, “und”, “für”, “Die”, “von”, “als”,
“sich”, “nicht”, “nach”, “auch” ]
现在我们可以通过删除坏词来删除和格式化文本
# initializing bad_chars_list
bad_words = ["a", "the" , "of", "in", "to", "and", "on", "de", "with",
"by", "at", "dans", "ont", "été", "les", "des", "au", "et",
"après", "avec", "qui", "par", "leurs", "ils", "a", "pour",
"les", "on", "as", "france", "eux", "où", "son", "le", "la",
"en", "with", "is", "has", "for", "that", "an", "but", "be",
"are", "du", "it", "à", "had", "ist", "Der", "um", "zu", "den",
"der", "-", "und", "für", "Die", "von", "als",
"sich", "nicht", "nach", "auch" ]
r = text_combined.replace('\s+',
' ').replace(',',
' ').replace('.',
' ')
words = r.split()
rst = [word for word in words if
( word.lower() not in bad_words
and len(word) > 3) ]
rst = ' '.join(rst)
wordcount={}
for word in rst.split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k,v, in sorted(wordcount.items(),
key=lambda words: words[1],
reverse = True):
print(k,v)
输出:
让我们绘制输出
word = WordCloud(max_font_size = 40).generate(rst)
plt.figure()
plt.imshow(word, interpolation ="bilinear")
plt.axis("off")
plt.show()
输出:
正如你在文章描述中看到的,默克尔最关心的问题是他的国防部长克兰普-卡伦鲍尔,坎兹莱林只是指女总理。我们可以只使用标题做同样的工作
title_combined = ''
for i in response_json['articles']:
title_combined += i['title'] + ' '
titles = title_combined.replace('\s+',
' ').replace(',',
' ').replace('.',
' ')
words_t = titles.split()
result = [word for word in words_t if
( word.lower() not in bad_words and
len(word) > 3) ]
result = ' '.join(result)
wordcount={}
for word in result.split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
word = WordCloud(max_font_size=40).generate(result)
plt.figure()
plt.imshow(word, interpolation="bilinear")
plt.axis("off")
plt.show()
输出:
从头衔中我们发现,最关心默克尔的是土耳其总统阿尔多安。