在Python使用lxml和XPath进行Web爬取

📌 相关文章

📜 在Python使用lxml和XPath进行Web爬取

📅 最后修改于: 2021-05-20 08:16:02 🧑 作者: Mango

先决条件： Web爬网简介

在本文中，我们将讨论lxml Python库以从网页中抓取数据，该网页是在用C语言编写的libxml2 XML解析库的基础上构建的。与其他Python Web抓取库(例如BeautifulSoup和Selenium， lxml包提供了在性能方面具有优势。读取和写入大型XML文件花费的时间很少，从而使数据处理变得更加容易和快捷。

我们将使用用于Web Scraping的lxml库和用于在Python发出HTTP请求的请求库。可以使用适用于Python的pip软件包安装程序将它们安装在命令行中。

使用lxml从网页上的元素获取数据需要使用Xpaths 。

使用XPath

XPath的工作方式非常类似于传统的文件系统

文件系统图

要访问文件1，

C:/File1

同样，要访问文件2，

C:/Documents/User1/File2

现在考虑一个简单的网页，

HTML


   
       My page
   
   
       Welcome to my page

      page
         
This is the first paragraph
  
       
Hello World

Python

# Import required modules
from lxml import html
import requests
  
# Request the page
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
  
# Parsing the page
# (We need to use page.content rather than 
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content)  
  
# Get element using XPath
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
print(buyers)

Python

# Import required modules
from lxml import html
import requests
  
# Request the page
page = requests.get('https://webscraper.io/test-sites/e-commerce/allinone')
  
# Parsing the page
tree = html.fromstring(page.content)
  
# Get element using XPath
prices = tree.xpath(
    '//div[@class="col-sm-4 col-lg-4 col-md-4"]/div/div[1]/h4[1]/text()')
print(prices)

可以将其表示为XML树，如下所示：

网页的XML树

为了使文本包含在< p>标记中，

XPath : html/body/p/text()

Result : This is the first paragraph

对于在锚或<一>标记得到属性内的值，

XPath : html/body/a/@href

Result: www.example.com

为了在第二个< h2>标记内获取值，

XPath : html/body/h2[2]/text()

Result: Hello World

要在页面上查找特定元素的XPath，请执行以下操作：

右键单击页面中的元素，然后单击检查。
右键单击“元素”选项卡中的元素。
单击复制XPath 。

使用LXML

循序渐进的方法

我们将使用request.get来检索包含我们数据的网页。
我们使用html.fromstring通过lxml解析器解析内容。
我们创建正确的XPath查询，并使用lxml xpath函数获取所需的元素。

范例1：

以下是基于上述方法的程序，该程序使用特定的URL。

Python

# Import required modules
from lxml import html
import requests
  
# Request the page
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
  
# Parsing the page
# (We need to use page.content rather than 
# page.text because html.fromstring implicitly
# expects bytes as input.)
tree = html.fromstring(page.content)  
  
# Get element using XPath
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
print(buyers)

输出：

范例2：

电子商务网站的另一个示例URL。

Python

# Import required modules
from lxml import html
import requests
  
# Request the page
page = requests.get('https://webscraper.io/test-sites/e-commerce/allinone')
  
# Parsing the page
tree = html.fromstring(page.content)
  
# Get element using XPath
prices = tree.xpath(
    '//div[@class="col-sm-4 col-lg-4 col-md-4"]/div/div[1]/h4[1]/text()')
print(prices)

输出：

使用XPath

HTML

Welcome to my page

page This is the first paragraph

Hello World

Python

Python

要在页面上查找特定元素的XPath，请执行以下操作：

使用LXML

循序渐进的方法

Python

Python

page
This is the first paragraph