使用Python解析 HTML 文档并将其转换为 XML 格式

在本文中，我们将了解如何使用Python解析 HTML 文档并将其转换为 XML 格式。

可以通过以下方式完成：

使用 Ixml 模块。
使用 Beautifulsoup 模块。

方法一：使用Python lxml库

在这种方法中，我们将使用 Python 的 lxml 库来解析 HTML 文档并将其写入 XML 树的编码字符串表示形式。 lxml XML 工具包是 C 库 libxml2 和 libxslt 的 Pythonic 绑定。它的独特之处在于它将这些库的速度和 XML 功能完整性与本机Python API 的简单性相结合，大部分兼容但优于众所周知的 ElementTree API。

安装：

pip install lxml

我们需要提供打开 HTML 文档的路径，以使用html.fromstring( str )函数读取和解析它，返回单个元素/文档树。此函数从给定的字符串解析文档。这总是会创建一个正确的 HTML 文档，这意味着父节点是，并且有一个主体，可能还有一个头部。

htmldoc = html.fromstring(inp.read())

编程需要懂一点英语

并使用etree.tostring()函数将解析后的 HTML 元素/文档树写入其 XML 树的编码字符串表示形式。

out.write(etree.tostring(htmldoc))

编程需要懂一点英语

使用的 HTML 文件：输入。

代码：

Python3

# Import the required library
from lxml import html, etree
  
# Main Function
if __name__ == '__main__':
  
    # Provide the path of the html file
    file = "input.html"
  
    # Open the html file and Parse it, 
    # returning a single element/document.
    with open(file, 'r', encoding='utf-8') as inp:
        htmldoc = html.fromstring(inp.read())
  
    # Open a output.xml file and write the 
    # element/document to an encoded string 
    # representation of its XML tree.
    with open("output.xml", 'wb') as out:
        out.write(etree.tostring(htmldoc))

Python3

# Import the required library
from bs4 import BeautifulSoup
  
# Main Function
if __name__ == '__main__':
  
    # Provide the path of the html file
    file = "input.html"
  
    # Open the html file and Parse it 
    # using Beautiful soup's html.parser.
    with open(file, 'r', encoding='utf-8') as inp:
        soup = BeautifulSoup(inp, 'html.parser')
      
    # Split the document by lines and join the lines
    # from index 1 to remove the doctype Html as it is 
    # present in index 0 from the parsed document.
    lines = soup.prettify().splitlines()
    content = "\n".join(lines[1:])
  
    # Open a output.xml file and write the modified content.
    with open("output.xml", 'w', encoding='utf-8') as out:
        out.write(content)

输出：

方法二：使用 BeautifulSoup

在这种方法中，我们将使用 BeautifulSoup 模块使用 html.parser 解析原始 HTML 文档，并修改解析后的文档并将其写入 XML 文件。提供打开 HTML 文件并读取 HTML 文件的路径，并使用 BeautifulSoup 的 html.parser 对其进行解析，返回已解析文档的对象。

BeautifulSoup(inp, ‘html.parser’)

编程需要懂一点英语

要删除 DocType HTML，我们需要首先使用soup.prettify()获取文档的字符串表示，然后使用splitlines()按行拆分文档，返回行列表。

soup.prettify().splitlines()

编程需要懂一点英语

代码：

蟒蛇3

# Import the required library
from bs4 import BeautifulSoup
  
# Main Function
if __name__ == '__main__':
  
    # Provide the path of the html file
    file = "input.html"
  
    # Open the html file and Parse it 
    # using Beautiful soup's html.parser.
    with open(file, 'r', encoding='utf-8') as inp:
        soup = BeautifulSoup(inp, 'html.parser')
      
    # Split the document by lines and join the lines
    # from index 1 to remove the doctype Html as it is 
    # present in index 0 from the parsed document.
    lines = soup.prettify().splitlines()
    content = "\n".join(lines[1:])
  
    # Open a output.xml file and write the modified content.
    with open("output.xml", 'w', encoding='utf-8') as out:
        out.write(content)

输出：