beautifulsoup import (1)

📌 相关文章

📜 beautifulsoup import (1)

📅 最后修改于: 2023-12-03 15:29:36.452000 🧑 作者: Mango

Beautifulsoup的介绍

Beautifulsoup是一个Python库，它可以从HTML和XML文档中提取数据，使用简单而强大的 API 和解析器，可以快速轻松地从网页的 HTML/XML 源代码中提取有效数据。

安装 Beautifulsoup

你可以使用 pip 命令在你的Python环境中安装Beautifulsoup

pip install beautifulsoup4

示例代码

from bs4 import BeautifulSoup
import requests

url = 'https://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

print(soup.prettify())

Beautifulsoup的解析器

Beautifulsoup使用解析器从HTML文档中提取数据。默认情况下，我们可以使用Python内置的 html.parser解析器。也可以使用其他一些解析器。

from bs4 import BeautifulSoup
import requests

url = 'https://www.example.com'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')  # 使用'xml'解析器
soup = BeautifulSoup(response.text, 'html5lib')  # 使用'html5lib'解析器

Beautifulsoup的功能

1.提取标签

我们可以使用 find 方法和 find_all 方法来提取 HTML 文档中的标签。

from bs4 import BeautifulSoup
import requests

url = 'https://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 通过标签名提取第一个匹配的标签
a_tag = soup.find('a')

# 通过标签名提取所有匹配的标签
a_tags = soup.find_all('a')

2.提取属性

我们可以使用字典方式获取标签的属性。

from bs4 import BeautifulSoup
import requests

url = 'https://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 获取a标签的href属性
a_tag = soup.find('a')
href_value = a_tag['href']

3.遍历文档树

我们可以使用 descendants 属性遍历文档树。

from bs4 import BeautifulSoup
import requests

url = 'https://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 遍历所有的子节点
for child in soup.descendants:
    print(child)

4.修改文档

我们可以修改文档树中的标签、属性或字符串。

from bs4 import BeautifulSoup
import requests

url = 'https://www.example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 修改第一个匹配的标签
title_tag = soup.find('title')
title_tag.string = 'New Title'

# 添加一个新的标签
new_tag = soup.new_tag('p')
new_tag.string = 'This is a new paragraph.'
soup.body.append(new_tag)

print(soup.prettify())

最后再次强调，美丽汤（Beautifulsoup）非常强大，使用也不是很难，让我们在解析 HTML 和 XML 文档中的有效数据方面更加轻松。