使用Python和 News API 进行报纸抓取

从网站中提取数据主要有两种方法：

使用网站的 API（如果存在）。例如，Facebook 有 Facebook Graph API，它允许检索发布在 Facebook 上的数据。
访问网页的 HTML 并从中提取有用的信息/数据。这种技术称为网络抓取或网络收获或网络数据提取。

在本文中，我们将使用newsapi的 API。您可以通过单击此处创建自己的 API 密钥。

例子：让我们确定一个像报纸引用的国家总统这样的人物的关注，让我们以默克尔为例

import pprint
import requests
  
  
secret = "Your API"
   
# Define the endpoint
url = 'https://newsapi.org/v2/everything?'
   
# Specify the query and
# number of returns
parameters = {
    'q': 'merkel', # query phrase
    'pageSize': 100,  # maximum is 100
    'apiKey': secret # your own API key
}
   
# Make the request
response = requests.get(url, 
                        params = parameters)
   
# Convert the response to 
# JSON format and pretty print it
response_json = response.json()
pprint.pprint(response_json)

输出：

python-news-scraping-1

让我们将所有文本合并起来，然后将单词从大到小排序。

from wordcloud import WordCloud
import matplotlib.pyplot as plt
  
  
text_combined = ''
  
for i in response_json['articles']:
      
    if i['description'] != None:
        text_combined += i['description'] + ' '
          
wordcount={}
for word in text_combined.split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
  
for k,v, in sorted(wordcount.items(),
                   key=lambda words: words[1], 
                   reverse = True):
    print(k,v)

输出：

python-news-scraping-2

这个评价是模棱两可的，如果我们删除不好或无用的词，我们可以更清楚地说明。让我们定义一些如下所示的 bad_words

bad_words = [“a”, “the”, “of”, “in”, “to”, “and”, “on”, “de”, “with”,
“by”, “at”, “dans”, “ont”, “été”, “les”, “des”, “au”, “et”,
“après”, “avec”, “qui”, “par”, “leurs”, “ils”, “a”, “pour”,
“les”, “on”, “as”, “france”, “eux”, “où”, “son”, “le”, “la”,
“en”, “with”, “is”, “has”, “for”, “that”, “an”, “but”, “be”,
“are”, “du”, “it”, “à”, “had”, “ist”, “Der”, “um”, “zu”, “den”,
“der”, “-“, “und”, “für”, “Die”, “von”, “als”,
“sich”, “nicht”, “nach”, “auch” ]

编程需要懂一点英语

现在我们可以通过删除坏词来删除和格式化文本

# initializing bad_chars_list 
bad_words = ["a", "the" , "of", "in", "to", "and", "on", "de", "with", 
             "by", "at", "dans", "ont", "été", "les", "des", "au", "et", 
             "après", "avec", "qui", "par", "leurs", "ils", "a", "pour", 
             "les", "on", "as", "france", "eux", "où", "son", "le", "la",
             "en", "with", "is", "has", "for", "that", "an", "but", "be", 
             "are", "du", "it", "à", "had", "ist", "Der", "um", "zu", "den", 
             "der", "-", "und", "für", "Die", "von", "als",
             "sich", "nicht", "nach", "auch"  ] 
  
  
r = text_combined.replace('\s+',
                          ' ').replace(',', 
                                       ' ').replace('.',
                                                    ' ')
words = r.split()
rst = [word for word in words if 
       ( word.lower() not in bad_words 
        and len(word) > 3) ]
  
rst = ' '.join(rst)
   
wordcount={}
  
for word in rst.split():
      
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
   
for k,v, in sorted(wordcount.items(),
                   key=lambda words: words[1],
                   reverse = True):
    print(k,v)

输出：

python-news-scraping-3

让我们绘制输出

word = WordCloud(max_font_size = 40).generate(rst)
plt.figure()
plt.imshow(word, interpolation ="bilinear")
plt.axis("off")
plt.show()

输出：

python-news-scraping-4

正如你在文章描述中看到的，默克尔最关心的问题是他的国防部长克兰普-卡伦鲍尔，坎兹莱林只是指女总理。我们可以只使用标题做同样的工作

title_combined = ''
  
for i in response_json['articles']:
    title_combined += i['title'] + ' '
      
titles = title_combined.replace('\s+',
                                ' ').replace(',',
                                             ' ').replace('.',
                                                          ' ')
words_t = titles.split()
result = [word for word in words_t if
          ( word.lower() not in bad_words and
           len(word) > 3) ]
  
result = ' '.join(result)
   
wordcount={}
  
for word in result.split():
      
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
  
word = WordCloud(max_font_size=40).generate(result)
plt.figure()
plt.imshow(word, interpolation="bilinear")
plt.axis("off")
plt.show()

输出：

python-news-scraping-5

从头衔中我们发现，最关心默克尔的是土耳其总统阿尔多安。