使用 BeautifulSoup 获取所有 HTML 标签

网页抓取是使用机器人（如称为网页抓取工具的软件）从 HTML 或 XML 内容中提取信息的过程。 Beautiful Soup 就是这样一个用于通过Python抓取数据的库。 Beautiful Soup 解析网页的 HTML 内容并收集它以提供迭代、搜索和修改功能。为了提供这些功能，它与将内容转换为解析树的解析器一起工作。使用您熟悉的解析器使用 BeautifulSoup 抓取网页相当容易。

要使用 BeautifulSoup 库获取网页的所有 HTML 标签，首先导入 BeautifulSoup 和 requests 库以向网页发出GET 请求。

循序渐进的方法：

导入所需的模块。

Python3

from bs4 import BeautifulSoup
import requests

Python3

# Assign URL
url = "https://www.geeksforgeeks.org/"
  
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

Python3

# Parse the html content using any parser 
soup = BeautifulSoup(html_content,"html.parser")

Python3

[tag.name for tag in soup.find_all()]

Python3

# Import modules
from bs4 import BeautifulSoup
import requests
  
# Assign URL
url = "https://www.geeksforgeeks.org/"
  
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
  
# Parse the html content using any parser
soup = BeautifulSoup(html_content, "html.parser")
  
# Display HTML tags
[tag.name for tag in soup.find_all()]

导入库后，现在使用网页的 URL 分配一个 URL 变量，并发出 GET 请求以获取原始 HTML 内容：

蟒蛇3

# Assign URL
url = "https://www.geeksforgeeks.org/"
  
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

现在解析 HTML 内容：

蟒蛇3

# Parse the html content using any parser 
soup = BeautifulSoup(html_content,"html.parser")

现在要获取网页的所有 HTML 标签，请使用 find_all()函数为标签的.name属性运行一个循环：

蟒蛇3

[tag.name for tag in soup.find_all()]

下面是完整的程序：

蟒蛇3

# Import modules
from bs4 import BeautifulSoup
import requests
  
# Assign URL
url = "https://www.geeksforgeeks.org/"
  
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
  
# Parse the html content using any parser
soup = BeautifulSoup(html_content, "html.parser")
  
# Display HTML tags
[tag.name for tag in soup.find_all()]

输出：

['html',
 'head',
 'meta',
 'meta',
 'meta',
 'link',
 'meta',
 'meta',
 'meta',
 'meta',
 'meta',
 'script',
 'script',
 'link',
 'title',
 'link',
 'link',
 'script',
 'script']