BeautifulSoup – 错误处理

有时，在从网站上抓取数据时，我们都会遇到几种类型的错误，其中有些是无法理解的，有些是基本的语法错误。在这里，我们将讨论在编写脚本时遇到的异常类型。

获取网站时出错

当我们获取任何网站内容时，我们需要了解获取过程中发生的一些错误。这些错误可能是HTTPError、 URLError、AttributeError 或 XMLParserError 。现在我们将一一讨论每个错误。

HTTP错误：

当我们在服务器上不存在或不可用的网站上执行网页抓取操作时，会发生 HTTPError。当我们在向服务器请求时提供了错误的链接并且我们执行程序时总是在终端上显示错误“找不到页面” 。

例子：

Python3

# importing modules
import requests
from urllib.error import HTTPError
 
url = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
 
try:
    response = requests.get(url)
    response.raise_for_status()
except HTTPError as hp:
    print(hp)
     
else:
    print("it's worked")

Python

# importing modules
import requests
from urllib.error import HTTPError
 
url = 'https://www.geeksforgeeks.org/page-that-do-not-exist'
 
try:
    response = requests.get(url)
    response.raise_for_status()
except HTTPError as hp:
    print(hp)
     
else:
    print("it's worked")

Python3

# importing modules
import requests
from urllib.error import URLError
 
url = 'https://www.geeksforgeks.org/implementing-web-scraping-python-beautiful-soup/'
 
try:
  response = requests.get(url)
  response.raise_for_status()
except URLError as ue:
  print("The Server Could Not be Found")
   
else:
  print("No Error")

Python3

# importing modules
import requests
from urllib.error import URLError
 
url = 'https://www.geeksforgeks.org/implementing-web-scraping-python-beautiful-soup/'
 
try:
    response = requests.get(url)
    response.raise_for_status()
except URLError as ue:
    print("The Server Could Not be Found")
 
else:
    print("No Error")

Python3

# importing modules
import requests
import bs4
 
url = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
 
# getting response from server
response = requests.get(url)
 
# extracting html
soup = bs4.BeautifulSoup(response.text, 'html.parser')
 
# for printing attribute error
print(soup.NoneExistingTag.SomeTag)

Python

import requests
import bs4
 
url = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'xml')
 
print(soup.find('div',class_='that not present in html content'))

输出：

我们提供的 URL 链接运行正常，没有发生错误。现在我们通过更改链接看到 HTTPError。

Python

# importing modules
import requests
from urllib.error import HTTPError
 
url = 'https://www.geeksforgeeks.org/page-that-do-not-exist'
 
try:
    response = requests.get(url)
    response.raise_for_status()
except HTTPError as hp:
    print(hp)
     
else:
    print("it's worked")

输出：

网址错误：

当我们从服务器请求错误的网站时，这意味着我们请求的 URL 是错误的，然后会发生 URLError。 URLError 始终响应为服务器未找到错误。

例子：

蟒蛇3

# importing modules
import requests
from urllib.error import URLError
 
url = 'https://www.geeksforgeks.org/implementing-web-scraping-python-beautiful-soup/'
 
try:
  response = requests.get(url)
  response.raise_for_status()
except URLError as ue:
  print("The Server Could Not be Found")
   
else:
  print("No Error")

输出：

在这里我们看到程序执行正确并打印输出“No Error”。现在我们更改显示 URLError 的 URL 链接：-

蟒蛇3

# importing modules
import requests
from urllib.error import URLError
 
url = 'https://www.geeksforgeks.org/implementing-web-scraping-python-beautiful-soup/'
 
try:
    response = requests.get(url)
    response.raise_for_status()
except URLError as ue:
    print("The Server Could Not be Found")
 
else:
    print("No Error")

输出：

属性错误：

当进行无效的属性引用或属性分配失败时，会引发 BeautifulSoup 中的 AttributeError。当代码的执行过程中，我们通过了错误的属性的函数属性没有与函数的关系，然后AttributeError的发生。当我们尝试从网站上使用 BeautifulSoup 访问标签并且该标签不存在于该网站时，BeautifulSoup 总是会给出一个 AttributeError。

我们举一个很好的例子来解释 AttributeError 的概念和使用 BeautifulSoup 的网页抓取：

蟒蛇3

# importing modules
import requests
import bs4
 
url = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
 
# getting response from server
response = requests.get(url)
 
# extracting html
soup = bs4.BeautifulSoup(response.text, 'html.parser')
 
# for printing attribute error
print(soup.NoneExistingTag.SomeTag)

输出：

XML 解析器错误：

我们在编写网页抓取脚本的过程中都经历过 XML 解析器错误，在 BeautifulSoup 的帮助下，我们很容易将文档解析为 HTML。如果我们坚持解析器错误，那么我们可以通过使用 BeautifulSoup 轻松克服此错误，并且非常易于使用。

当我们从网站解析 HTML 内容时，我们通常在 BeautifulSoup 构造函数的参数中使用' xml '或' xml-xml ' 。它被写成 HTML 文档之后的第二个参数。

Syntax:

soup = bs4.BeautifulSoup( response, ‘ xml ‘ )

soup = bs4.BeautifulSoup( response, ‘ xml -xml’ )

编程需要懂一点英语

当我们没有在 find() 中传递任何元素并且 find_all()函数或文档中缺少元素时，通常会发生 XML 解析器错误。它有时将空括号[]或None作为其输出。

Python

import requests
import bs4
 
url = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'xml')
 
print(soup.find('div',class_='that not present in html content'))

输出：