Python|使用 lxml 从 HTML 中提取 URL

在处理 HTML 解析时，链接提取是一项非常常见的任务。对于每个通用网络爬虫来说，这是最重要的函数。在现有的所有Python库中， lxml是最好使用的库之一。如本文所述，lxml 提供了许多帮助函数以提取链接。

lxml安装——

它是 C 库的Python绑定 - libxslt和libxml2 。所以，维护一个Python基础，它是非常快速的 HTML 解析和 XML 库。为了让它工作——还需要安装 C 库。有关安装说明，请点击此链接。

安装命令——

sudo apt-get install python-lxml or
pip install lxml

什么是lxml？
它是专门为解析 HTML 而设计的，因此带有一个 html 模块。借助fromstring()函数可以轻松解析 HTML字符串。这将返回所有链接的列表。

iterlinks()方法有四个元组形式的参数——

element : Link is extracted from this parsed node of the anchor tag. If interested in the link only, this can be ignored.
attr : attribute of the link from where it has come from, that is simply ‘href’
link : The actual URL extracted from the anchor tag.
pos : The anchor tag numeric index of the anchor tag in the document.

编程需要懂一点英语

代码#1：

# importing library
from lxml import html
string_document = html.fromstring('hi geeks')
  
# actual url
link = list(string_document.iterlinks())
  
# Link length
print ("Length of the link : ", len(link)

输出：

Length of the link : 1

代码 #2：检索iterlinks()元组

(element, attribute, link, pos) = link[0]
      
print ("attribute : ", attribute)
print ("\nlink : ", link)
print ("\nposition : ", position)

输出：

attribute : 'href'

link : '/world'

position : 0

在职的 -

ElementTree 是在 lxml 解析 HTML 时构建的。 ElementTree 是具有父节点和子节点的树结构。树中的每个节点都代表一个 HTML 标签，它包含标签的所有相关属性。创建后的树可以迭代以查找元素。这些元素可以是锚或链接标签。虽然 lxml.html 模块仅包含用于创建和迭代树的 HTML 特定函数， lxml.etree module包含核心树处理代码。

从文件中解析 HTML –

无需使用fromstring()函数来解析 HTML，而是可以使用文件名或 URL 调用parse()函数——例如html.parse('http://the/url')或html.parse('/path/to/filename') 。将生成与在 URL 或文件中加载的结果相同的结果，然后调用字符串 fromstring() 。

代码 #3：ElementTree 工作

import requests
import lxml.html
  
# requesting url
web_response = requests.get('https://www.geeksforgeeks.org/')
  
# building
element_tree = lxml.html.fromstring(web_response.text)
  
tree_title_element = element_tree.xpath('//title')[0]
  
print("Tag title : ", tree_title_element.tag)
print("\nText title :", tree_title_element.text_content())
print("\nhtml title :", lxml.html.tostring(tree_title_element))
print("\ntitle tag:", tree_title_element.tag)
print("\nParent's tag title:", tree_title_element.getparent().tag)

输出：

Tag title :  title

Text title : GeeksforGeeks | A computer science portal for geeks

html title : b'GeeksforGeeks | A computer science portal for geeks\r\n'

title tag: title

Parent's tag title: head

使用请求报废——

request是一个Python库，用于报废网站。它使用以 URL 作为参数的get()方法请求 Web 服务器的 URL，作为回报，它给出 Response 对象。该对象将包括有关请求和响应的详细信息。要阅读网页内容，请使用response.text()方法。该内容由网络服务器根据请求发回。

代码#4：请求网络服务器

import requests
  
web_response = requests.get('https://www.geeksforgeeks.org/')
print("Response from web server : \n", web_response.text)

输出：
它将生成一个巨大的脚本，这里只添加了一个示例。

Response from web server : 



<

...
...
...