在 BeautifulSoup 中编码

字符编码在解释 HTML 和 XML 文档的内容中起着重要作用。文档不仅包含英文字符，还包含非英文字符，如希伯来文、拉丁文、希腊文等等。为了让解析器知道应该使用哪种编码方法，文档将包含一个专用标签和属性来指定它。例如：

在 HTML 文档中

编程需要懂一点英语

在 XML 文档中

encoding=”–encoding method name–“?>

编程需要懂一点英语

这些标签传达了浏览器可以使用哪种编码方法进行解析。如果未指定正确的编码方法，则内容呈现不正确或有时带有替换字符“ � ”。

XML 编码方法

XML 文档可以用下面列出的格式之一进行编码。

UTF-8
UTF-16
拉丁语1
US-ASCII
ISO-8859-1 到 ISO-8859-10
在这些方法中，常见的是 UTF-8。 UTF-16 允许每个字符2 个字节，并且带有 '0xx' 的文档就是用这种方法编码的。 Latin1 涵盖西欧字符。
HTML 编码方法
HTML 和 HTML5 文档可以通过以下任何一种方法进行编码。
UTF-8
UTF-16
ISO-8859-1
UTF-16BE（大印度）
UTF-16LE（小印度）
WINDOWS-874
WINDOWS-1250 至 WINDOWS-1258
对于 HTML5 文档，主要推荐使用 UTF-8。 ISO-8859-1 主要用于 XHTML 文档。一些方法如 UTF-7、UTF-32、BOCU-1、CESU-8 被明确提到不要使用，因为它们用替换字符“ � ”替换了大部分字符。
BeautifulSoup 和编码
BeautifulSoup 模块，普遍导入为 bs4，是一个福音，它使 HTML/XML 解析变得轻而易举。它有很多方法，其中一种帮助通过标签名称或标签中存在的属性来选择内容，一种帮助基于层次结构提取内容，打印具有 HTML 所需的缩进的内容，等等。 bs4 模块自动检测文档中使用的编码方法，并有效地将其转换为合适的格式。返回的 BeautifulSoup 对象将具有提供更多信息的各种属性。但是，有时它会错误地预测编码方法。因此，如果用户知道编码方法，最好将其作为参数传递。本文提供了在 bs4 模块中指定编码方法的各种方式。
original_encoding
bs4 模块有一个名为 Unicode 的子库，Dammit 可以找到编码方法并使用它来转换为 Unicode字符。 original_encoding 属性用于返回检测到的编码方式。
示例 1：
给定一个 HTML 元素，解析它并找到使用的编码方法。
Python3
from bs4 import BeautifulSoup    # HTML element with content h1 = b"Hello world!!"    # parsing with html parser parsed = BeautifulSoup(h1, "html.parser")    # tag found print("Tag foud :", parsed.h1.name)    # the content inside the tag print("Content :", parsed.h1.string)    # the encoded method print("Encoding method :", parsed.original_encoding)

Python3
from bs4 import BeautifulSoup import requests    URL = 'https://www.geeksforgeeks.org/python-update-nested-dictionary/'    # request the page from server page = requests.get(URL)    # parse the contentes of the page soup = BeautifulSoup(page.content, "html.parser")    # encoded method print("Enoded method :", soup.original_encoding)

Python3
from bs4 import BeautifulSoup    soup=BeautifulSoup(page.content,"html.parser")    # fetching the tag's # charset attribute # of the content above tag=soup.meta['charset']    print("Encoding method :",tag)

Python3
from bs4 import BeautifulSoup    # HTML element input = b"\xa2\xf6`\xe0"    # parsing content soup = BeautifulSoup(input)    print("Content :",soup.h1.string)    print("Encoding method :",soup.original_encoding)

Python3
from bs4 import BeautifulSoup    # HTML element input = b"\xa2\xf6`\xe0"    # parsing content soup = BeautifulSoup(input, "html.parser", from_encoding="iso-8859-8")    print("Content :",soup.h1.string)    print("Encoding method :",soup.original_encoding)

Python3
# import module from bs4 import BeautifulSoup    # HTML element input = b''' \xa2\xf6`\xe0 '''    # parsing content soup = BeautifulSoup(input,"html.parser")    print(soup.prettify())

Python3
from bs4 import BeautifulSoup    # HTML element input = b''' \xa2\xf6`\xe0 '''    # parsing content soup = BeautifulSoup(input,"html.parser")    print(soup.prettify("iso-8859-8"))

Python3
from bs4 import BeautifulSoup    # HTML element input = b"\xa2\xf6`\xe0"    # parsing content soup = BeautifulSoup(input)    print("Content :",soup.h1.string)    print("Encoding method :",soup.original_encoding)    print("After explicit encoding :",soup.html.encode("iso-8859-8"))

输出：
此处，HTML 元素字符串以“ b ”为前缀，这意味着将其视为字节字面量。因此，解析器检测并使用ASCII编码方法。在实际情况下，原始编码将是 HTML 文档中提到的编码
示例 2：
给定一个 URL，解析内容并找到原始编码方法。
蟒蛇3
from bs4 import BeautifulSoup import requests    URL = 'https://www.geeksforgeeks.org/python-update-nested-dictionary/'    # request the page from server page = requests.get(URL)    # parse the contentes of the page soup = BeautifulSoup(page.content, "html.parser")    # encoded method print("Enoded method :", soup.original_encoding)
输出
Enoded method : utf-8
验证输出：
蟒蛇3
from bs4 import BeautifulSoup    soup=BeautifulSoup(page.content,"html.parser")    # fetching the tag's # charset attribute # of the content above tag=soup.meta['charset']    print("Encoding method :",tag)
输出

Encoding method : UTF-8
from_encoding
这是一个可以传递给构造函数BeautifulSoup() 的参数。这明确地告诉 bs4 模块，必须使用哪种编码方法。这可以节省时间并避免由于错误预测而导致的错误解析。
例子：
蟒蛇3
from bs4 import BeautifulSoup    # HTML element input = b"\xa2\xf6`\xe0"    # parsing content soup = BeautifulSoup(input)    print("Content :",soup.h1.string)    print("Encoding method :",soup.original_encoding)
如果生成以下警告：
/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“html5lib”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup])
to this:
BeautifulSoup([your markup], “html5lib”)
markup_type=markup_type))

Traceback (most recent call last):
File “/home/98e5f50281480cda5f5e31e3bcafb085.py”, line 9, in
print(“Content :”,soup.h1.string)
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1: ordinal not in range(128)

编程需要懂一点英语
GeeksforGeeks 中的编辑器尝试用 ASCII 解析它并最终出现错误。在本地机器中执行相同代码的输出给出了以下输出：
但内容实际上对应于“ ISO-8859-8 ”并且解释的字符不是所需的字符。因此，通过明确提及已知的编码方法，将给出正确的输出。
蟒蛇3
from bs4 import BeautifulSoup    # HTML element input = b"\xa2\xf6`\xe0"    # parsing content soup = BeautifulSoup(input, "html.parser", from_encoding="iso-8859-8")    print("Content :",soup.h1.string)    print("Encoding method :",soup.original_encoding)
输出：

输出编码
当解析的 HTML 内容必须作为输出给出时，默认情况下 bs4 模块将其作为 UTF-8 编码文档或有时带有错误预测的文档。如果您希望通过其他方法对文档进行编码而不传递给构造函数，则可以执行以下操作：
prettify() ：此方法用于打印具有正确缩进的 HTML 内容。要使用的编码方法可以作为参数传递给此方法，以便在打印时也修改编码方法。
例子：
蟒蛇3
# import module from bs4 import BeautifulSoup    # HTML element input = b''' \xa2\xf6`\xe0 '''    # parsing content soup = BeautifulSoup(input,"html.parser")    print(soup.prettify())
输出：
在这里，您可以看到标记，其中编码设置为 UTF-8。为了防止这种情况，可以写如下。
蟒蛇3
from bs4 import BeautifulSoup    # HTML element input = b''' \xa2\xf6`\xe0 '''    # parsing content soup = BeautifulSoup(input,"html.parser")    print(soup.prettify("iso-8859-8"))
输出：
b'\n \n \n \n \xa2\xf6`\xe0\n \n \n'
encode() ：编码方法可用于显式传递所需的方法。这将用相应的 XML 引用替换字符。
例子：
蟒蛇3
from bs4 import BeautifulSoup    # HTML element input = b"\xa2\xf6`\xe0"    # parsing content soup = BeautifulSoup(input)    print("Content :",soup.h1.string)    print("Encoding method :",soup.original_encoding)    print("After explicit encoding :",soup.html.encode("iso-8859-8"))
输出：