使用 BeautifulSoup 查找给定标签的文本
网页抓取是使用称为网页抓取工具的软件机器人从网页的 HTML 或 XML 内容中提取信息的过程。 Beautiful Soup是一个用于通过Python抓取数据的库。 Beautiful Soup 与解析器一起工作以提供迭代、搜索和修改解析器提供的内容(以解析树的形式)。使用 Beautiful Soup 抓取网页并查找给定标签的文本相当容易。
在本文中,我们将讨论从给定标签中查找文本。
循序渐进的方法:
- 首先导入库。
Python3
from bs4 import BeautifulSoup
import requests
Python3
# assign URL
url = "https://www.geeksforgeeks.org/"
Python3
html_content = requests.get(url).text
Python3
# Now that the content is ready, iterate
# through the content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
Python3
print(soup.find('title'))
Python3
from bs4 import BeautifulSoup
import requests
# Assign URL
url = "https://www.geeksforgeeks.org/"
# Fetch raw HTML content
html_content = requests.get(url).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
# similarly to get all the occurences of a given tag
print(soup.find('title').text)
Python3
from bs4 import BeautifulSoup
import requests
# Assign URL
url = "https://www.geeksforgeeks.org/"
# Fetch raw HTML content
html_content = requests.get(url).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
# similarly to get all the occurences of a given tag
texts = soup.find_all('p')
for text in texts:
print(text.get_text())
- 现在分配 URL。
蟒蛇3
# assign URL
url = "https://www.geeksforgeeks.org/"
- 从 URL 中获取原始 HTML 内容。
蟒蛇3
html_content = requests.get(url).text
- 现在解析内容。
蟒蛇3
# Now that the content is ready, iterate
# through the content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
- 解析完内容后,我们搜索特定标签并打印其文本。
蟒蛇3
print(soup.find('title'))
下面是完整的程序。
蟒蛇3
from bs4 import BeautifulSoup
import requests
# Assign URL
url = "https://www.geeksforgeeks.org/"
# Fetch raw HTML content
html_content = requests.get(url).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
# similarly to get all the occurences of a given tag
print(soup.find('title').text)
输出:
类似地获取给定标签的所有出现:
蟒蛇3
from bs4 import BeautifulSoup
import requests
# Assign URL
url = "https://www.geeksforgeeks.org/"
# Fetch raw HTML content
html_content = requests.get(url).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
# similarly to get all the occurences of a given tag
texts = soup.find_all('p')
for text in texts:
print(text.get_text())
输出: