如何在Python中使用 BeautifulSoup 删除空标签?
先决条件:请求,BeautifulSoup,strip
任务是编写一个程序,从 HTML 代码中删除空标记。在 Beautiful Soup 中,没有内置方法可以删除没有内容的标签。
所需模块:
- bs4: Beautiful Soup(bs4) 是一个Python库,用于从 HTML 和 XML 文件中提取数据。这个模块没有内置于Python中。要安装此类型,请在终端中输入以下命令。
pip install bs4
- 要求: Requests 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python中。要安装此类型,请在终端中输入以下命令。
pip install requests
方法:
- 获取 HTML 代码
- 遍历每个标签
- 从标签中获取文本并使用条带删除空格。
- 删除空格后,检查如果文本的长度为零,则从HTML代码中删除标记。
示例 1:删除空标签。
Python3
# Import Module
from bs4 import BeautifulSoup
# HTML Object
html_object = """
some
text
here
"""
# Get HTML Code
soup = BeautifulSoup( html_object , "lxml")
# Iterate each line
for x in soup.find_all():
# fetching text from tag and remove whitespaces
if len(x.get_text(strip=True)) == 0:
# Remove empty tag
x.extract()
# Print HTML Code with removed empty tags
print(soup)
Python3
# Import Module
from bs4 import BeautifulSoup
import requests
# Page URL
URL = "https://www.geeksforgeeks.org/"
# Page content from Website URL
page = requests.get( URL )
# Get HTML Code
soup = BeautifulSoup( page.content , "lxml" )
# Iterate each line
for x in soup.find_all():
# fetching text from tag and remove whitespaces
if len( x.get_text ( strip = True )) == 0:
# Remove empty tag
x.extract()
# Print HTML Code with removed empty tags
print(soup)
输出:
sometexthere
示例 2:从给定 URL 中删除空标签。
蟒蛇3
# Import Module
from bs4 import BeautifulSoup
import requests
# Page URL
URL = "https://www.geeksforgeeks.org/"
# Page content from Website URL
page = requests.get( URL )
# Get HTML Code
soup = BeautifulSoup( page.content , "lxml" )
# Iterate each line
for x in soup.find_all():
# fetching text from tag and remove whitespaces
if len( x.get_text ( strip = True )) == 0:
# Remove empty tag
x.extract()
# Print HTML Code with removed empty tags
print(soup)
输出: