使用 BeautifulSoup 删除所有样式、脚本和 HTML 标签

先决条件： BeautifulSoup，请求

Beautiful Soap是一个Python库，用于从 HTML 和 XML 文件中提取数据。在本文中，我们将讨论如何使用漂亮的soap删除所有样式、脚本和HTML 标签。

所需模块：

bs4: Beautiful Soup (bs4) 是一个Python库，主要用于从 HTML、XML 和其他标记语言中提取数据。它是最常用的 Web Scraping 库之一。
在终端中运行以下命令来安装这个库——

pip install bs4

requests：这个库用于在Python发出 HTTP 请求。
在终端中运行以下命令来安装这个库——

pip install requests

方法：

导入 bs4 库
创建 HTML 文档
将内容解析为 BeautifulSoup 对象
迭代数据以使用分解（）方法从文档中删除标签
使用 stripped_strings() 方法检索标签内容
打印提取的数据

执行：

Python3

# Import Module
from bs4 import BeautifulSoup
  
# HTML Document
HTML_DOC = """
              
                
                     Geeksforgeeks 
                    
                    
                
                
                    is a
                    Computer Science portal.
                
              
            """
  
# Function to remove tags
def remove_tags(html):
  
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
  
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)
  
  
# Print the extracted data
print(remove_tags(HTML_DOC))

Python3

# Import Module
from bs4 import BeautifulSoup
import requests
  
# Website URL
URL = 'https://www.geeksforgeeks.org/data-structures/'
  
# Page content from Website URL
page = requests.get(URL)
  
# Function to remove tags
def remove_tags(html):
  
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
  
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)
  
  
# Print the extracted data
print(remove_tags(page.content))

输出：

Geeksforgeeks is a Computer Science portal.

从 URL 中删除所有样式、脚本和 HTML 标记

方法：

导入 bs4 和请求库
使用请求实例从给定的 URL 获取内容
将内容解析为 BeautifulSoup 对象
迭代数据以使用分解（）方法从文档中删除标签
使用 stripped_strings() 方法检索标签内容
打印提取的数据

执行：

蟒蛇3

# Import Module
from bs4 import BeautifulSoup
import requests
  
# Website URL
URL = 'https://www.geeksforgeeks.org/data-structures/'
  
# Page content from Website URL
page = requests.get(URL)
  
# Function to remove tags
def remove_tags(html):
  
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
  
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)
  
  
# Print the extracted data
print(remove_tags(page.content))

输出：