📜  在Python中使用 lxml 实现网页抓取

📅  最后修改于: 2022-05-13 01:55:22.956000             🧑  作者: Mango

在Python中使用 lxml 实现网页抓取

网络抓取基本上是指仅从一个或多个网站获取一些重要信息。每个网站都有可识别的 HTML 元素结构/模式。

为了完成这个任务,需要安装一些第三方包。使用 pip 安装 wheel(.whl) 文件。

pip install requests
pip install lxml

还需要元素的 xpath,数据将从中被废弃。一个简单的方法是——

1. 右键单击页面中需要报废的元素并转到“检查”。

2. 右键单击右侧源代码上的元素。

3.复制xpath。

这是“geeksforgeeks homepage”上的一个简单实现:

Python3
# Python3 code implementing web scraping using lxml
 
import requests
 
# import only html class
from lxml import html
 
# url to scrap data from
url = 'https://www.geeksforgeeks.org'
 
# path to particular element
path = '//*[@id ="post-183376"]/div / p'
 
# get response object
response = requests.get(url)
 
# get byte string
byte_data = response.content
 
# get filtered source code
source_code = html.fromstring(byte_data)
 
# jump to preferred html element
tree = source_code.xpath(path)
 
# print texts in first element in list
print(tree[0].text_content())


Python3
import requests
from lxml import html
 
# url to scrap data from
link = 'https://en.wikipedia.org / wiki / Web_scraping'
 
# path to particular element
path = '//*[@id ="mw-content-text"]/div / p[1]'
 
response = requests.get(link)
byte_string = response.content
 
# get filtered source code
source_code = html.fromstring(byte_string)
 
# jump to preferred html element
tree = source_code.xpath(path)
 
# print texts in first element in list
print(tree[0].text_content())


上面的代码从“geeksforgeeks homepage”主页中抓取了第一篇文章中的段落。
这是示例输出。每个人的输出可能都不一样,因为文章会发生变化。

输出 :

"Consider the following C/C++ programs and try to guess the output?
Output of all of the above programs is unpredictable (or undefined).
The compilers (implementing… Read More »"

这是从 Wiki-web-scraping 抓取的数据的另一个示例。

Python3

import requests
from lxml import html
 
# url to scrap data from
link = 'https://en.wikipedia.org / wiki / Web_scraping'
 
# path to particular element
path = '//*[@id ="mw-content-text"]/div / p[1]'
 
response = requests.get(link)
byte_string = response.content
 
# get filtered source code
source_code = html.fromstring(byte_string)
 
# jump to preferred html element
tree = source_code.xpath(path)
 
# print texts in first element in list
print(tree[0].text_content())

输出 :