Python Urllib 模块

Urllib 包是Python的 URL 处理模块。它用于获取 URL（统一资源定位器）。它使用urlopen函数，并且能够使用各种不同的协议获取 URL。

Urllib 是一个包，它收集了几个用于处理 URL 的模块，例如：

urllib.request 打开和阅读。
urllib.parse 用于解析 URL
引发异常的 urllib.error
用于解析 robots.txt 文件的 urllib.robotparser

如果您的环境中不存在 urllib，请执行以下代码进行安装。

pip install urllib

让我们详细看看这些。

urllib.request

该模块有助于定义打开 URL（主要是 HTTP）的函数和类。打开此类 URL 的最简单方法之一是：
urllib.request.urlopen(url)
我们可以在一个例子中看到这一点：

import urllib.request
request_url = urllib.request.urlopen('https://www.geeksforgeeks.org/')
print(request_url.read())

The source code of the URL i.e. Geeksforgeeks.

urllib.parse

该模块有助于定义操作 URL 及其组件部分的函数，以构建或破坏它们。它通常侧重于将 URL 拆分为小组件；或将不同的 URL 组件加入到 URL字符串中。
我们可以从下面的代码中看到这一点：

from urllib.parse import * parse_url = urlparse('https://www.geeksforgeeks.org / python-langtons-ant/')
print(parse_url)
print("\n")
unparse_url = urlunparse(parse_url)
print(unparse_url)

ParseResult(scheme='https', netloc='www.geeksforgeeks.org', path='/python-langtons-ant/', params='', query='', fragment='')

https://www.geeksforgeeks.org/python-langtons-ant/

注意：- URL 的不同组成部分被分离并再次连接。尝试使用其他 URL 以获得更好的理解。

urllib.parse 的其他不同功能是：

Function	Use
urllib.parse.urlparse	Separates different components of URL
urllib.parse.urlunparse	Join different components of URL
urllib.parse.urlsplit	It is similar to urlparse() but doesn’t split the params
urllib.parse.urlunsplit	Combines the tuple element returned by urlsplit() to form URL
urllib.parse.urldeflag	If URL contains fragment, then it returns a URL removing the fragment.

urllib.error
该模块定义了 urllib.request 引发的异常类。每当获取 URL 时出现错误时，此模块都会帮助引发异常。以下是引发的异常：

URLError – 它是针对 URL 中的错误或由于连接而在获取 URL 时发生的错误引发的，并且具有告诉用户错误原因的“原因”属性。
HTTPError – 为奇异的 HTTP 错误引发，例如身份验证请求错误。它是一个子类或 URLError。典型错误包括“404”（找不到页面）、“403”（请求被禁止）、
和“401”（需要身份验证）。

我们可以在以下示例中看到这一点：

# URL Error
  
import urllib.request
import urllib.parse
  
# trying to read the URL but with no internet connectivity
try:
    x = urllib.request.urlopen('https://www.google.com')
    print(x.read())
  
# Catching the exception generated     
except Exception as e :
    print(str(e))

URL Error: urlopen error [Errno 11001] getaddrinfo failed

# HTTP Error
  
import urllib.request
import urllib.parse
  
# trying to read the URL
try:
    x = urllib.request.urlopen('https://www.google.com / search?q = test')
    print(x.read())
  
# Catching the exception generated    
except Exception as e :
    print(str(e))

HTTP Error 403: Forbidden

urllib.robotparser
该模块包含一个类 RobotFileParser。此类回答有关特定用户是否可以获取发布 robots.txt 文件的 URL 的问题。 Robots.txt 是网站管理员创建的一个文本文件，用于指导网络机器人如何抓取其网站上的页面。 robots.txt 文件告诉网络爬虫不应该访问服务器的哪些部分。
例如：

# importing robot parser class
import urllib.robotparser as rb
  
bot = rb.RobotFileParser()
  
# checks where the website's robot.txt file reside
x = bot.set_url('https://www.geeksforgeeks.org / robot.txt')
print(x)
  
# reads the files
y = bot.read()
print(y)
  
# we can crawl the main site
z = bot.can_fetch('*', 'https://www.geeksforgeeks.org/')
print(z)
  
# but can not crawl the disallowed url
w = bot.can_fetch('*', 'https://www.geeksforgeeks.org / wp-admin/')
print(w)

None
None
True
False