在Python中使用 Newspaper3k 抓取网站
Web Scraping 是一种从网站收集信息的强大工具。要抓取多个 URL,我们可以使用名为Newspaper3k的Python库。 Newspaper3k包是一个用于网页抓取文章的Python库,它建立在请求之上并用于解析lxml 。该模块是Newspaper模块的改进版,也用于相同目的。
安装:
要安装此模块,请在终端中键入以下命令。
pip install newspaper3k
循序渐进的方法:
- 首先,我们将定义一个包含 URL 的列表或分配一个 URL。
- 我们将创建一个Article对象,传入参数如 URL 名称和可选参数如 language='en',对于英语
- 然后我们将下载并解析该文件。
- 最后,显示提取的数据。
以下是基于上述方法的一些示例:
示例 1
下面是一个从给定 URL 抓取数据的程序。
Python3
# Import required module
import newspaper
# Assingn url
url = 'https://www.geeksforgeeks.org/top-5-open-source-online-machine-learning-environments/'
# Extract web data
url_i = newspaper.Article(url="%s" % (url), language='en')
url_i.download()
url_i.parse()
# Display scrapped data
print(url_i.text)
Python3
# Import required modules
import newspaper
# Define list of urls
list_of_urls = ['https://www.geeksforgeeks.org/how-to-get-the-magnitude-of-a-vector-in-numpy/',
'https://www.geeksforgeeks.org/3d-wireframe-plotting-in-python-using-matplotlib/',
'https://www.geeksforgeeks.org/difference-between-small-data-and-big-data/']
# Parse through each url and display its content
for url in list_of_urls:
url_i = newspaper.Article(url="%s" % (url), language='en')
url_i.download()
url_i.parse()
print(url_i.text)
输出:
示例 2
在这里,我们从多个 URL 中抓取数据,然后将其显示出来。
蟒蛇3
# Import required modules
import newspaper
# Define list of urls
list_of_urls = ['https://www.geeksforgeeks.org/how-to-get-the-magnitude-of-a-vector-in-numpy/',
'https://www.geeksforgeeks.org/3d-wireframe-plotting-in-python-using-matplotlib/',
'https://www.geeksforgeeks.org/difference-between-small-data-and-big-data/']
# Parse through each url and display its content
for url in list_of_urls:
url_i = newspaper.Article(url="%s" % (url), language='en')
url_i.download()
url_i.parse()
print(url_i.text)
输出: