使用 BeautifulSoup 将 XML 结构转换为 DataFrame

📌 相关文章

📜 使用 BeautifulSoup 将 XML 结构转换为 DataFrame – Python(1)

📅 最后修改于: 2023-12-03 15:36:26.684000 🧑 作者: Mango

使用 BeautifulSoup 将 XML 结构转换为 DataFrame - Python

在Python中，使用BeautifulSoup库可以很方便地将XML结构转换为DataFrame，并且可以对数据进行简单的处理和分析，这为进行实际的数据分析和挖掘提供了便利。

安装及导入

使用前需先安装beautifulsoup4库及pandas库：

!pip install beautifulsoup4 pandas

在Python中，使用以下方式导入库：

from bs4 import BeautifulSoup
import pandas as pd

加载XML文件

首先，需要使用BeautifulSoup加载XML文件，获取其中的数据。

with open("example.xml", "rb") as f:
    data = f.read()

soup = BeautifulSoup(data, "xml")

这里的example.xml是样例数据文件，其中包含了XML结构的数据。

提取数据并转换为DataFrame

接下来，使用BeautifulSoup提取数据。可以通过标签名、属性等标准XML操作方法，获取XML中的数据，将其转换为DataFrame。

records = []

for record in soup.findAll("record"):
    record_id = record.id.get_text() if record.id else None
    record_title = record.title.get_text() if record.title else None
    record_description = record.description.get_text() if record.description else None

    records.append((record_id, record_title, record_description))

df = pd.DataFrame(records, columns=["Id", "Title", "Description"])

这里的records为一个列表，其中包含了所有记录的数据。在循环中，使用findAll方法查找所有的record标签，获取其中的id、title、description属性，并将其添加到records列表中。然后将records转换为DataFrame，设置columns为["Id", "Title", "Description"]，即为所需的列名。

结果展示

最后，通过head方法展示转换后的DataFrame的前几行数据。

print(df.head())

输出结果如下：

   Id                     Title                                        Description
0   1  Beautiful Soup, a Python                                  I live in New York.
1   2                      None  The Dormouse's story features a talking mouse...
2   3                      None  The Caterpillar and Alice looked at each other.
3   4                      None           And thus continued the work on the data.
4   5                      None                    I'm a graduate student in CS.

总结

使用BeautifulSoup库可以将XML结构转换为DataFrame，该方法简单方便，可以方便地进行数据处理和分析。同时，使用pandas库可以很方便地对DataFrame进行各种操作，包括数据清洗、聚合等操作，可以为实际的数据分析提供便利。