使用 BeautifulSoup 提取所有嵌套在
Beautiful Soup 是一个用于提取 html 和 xml 文件的Python库。在本文中,我们将了解如何从嵌套在
需要的模块和安装:
- BeautifulSoup:我们的主要模块包含一个通过 HTTP 访问网页的方法。
pip install bs4
- 请求:用于对网页执行 GET 请求并获取其内容。
注意:不需要单独安装,bs4会自动下载,如果有问题可以手动下载。
pip install requests
方法
- 我们将首先导入我们需要的库。
- 我们将对所需的网页执行 get 请求,我们希望从中获取所有 URL。
- 我们将文本传递给 BeautifulSoup函数并将其转换为一个汤对象。
- 使用 for 循环,我们将查找网页中的所有
- 标签。
- 如果
- 标签中有一个锚标签,我们将查找 href 属性并将其参数存储在列表中。这是我们要找的网址。
- 打印包含所有 url 的列表。
让我们看一下代码,我们将看到每个重要步骤发生了什么。
步骤 1:通过导入所有必需的库并设置网页的 URL 来初始化Python程序,您希望在该网页中包含在锚标记中的所有 URL。
在下面的示例中,我们将使用另一篇关于使用 BeautifulSoup 实现网页抓取的 geek for geeks 文章,并提取存储在嵌套在
文章链接:https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/
Python3
# Importing libraries
import requests
from bs4 import BeautifulSoup
# setting up the URL
URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
Python3
# perform get request to the url
reqs = requests.get(URL)
# extract all the text that you received
# from the GET request
content = reqs.text
# convert the text to a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')
Python3
# Empty list to store the output
urls = []
# For loop that iterates over all the tags
for h in soup.findAll('li'):
# looking for anchor tag inside the tag
a = h.find('a')
try:
# looking for href inside anchor tag
if 'href' in a.attrs:
# storing the value of href in a separate
# variable
url = a.get('href')
# appending the url to the output list
urls.append(url)
# if the list does not has a anchor tag or an anchor
# tag does not has a href params we pass
except:
pass
Python3
# print all the urls stored in the urls list
for url in urls:
print(url)
Python3
# Importing libraries
import requests
from bs4 import BeautifulSoup
# setting up the URL
URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
# perform get request to the url
reqs = requests.get(URL)
# extract all the text that you received from
# the GET request
content = reqs.text
# convert the text to a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')
# Empty list to store the output
urls = []
# For loop that iterates over all the tags
for h in soup.findAll('li'):
# looking for anchor tag inside the tag
a = h.find('a')
try:
# looking for href inside anchor tag
if 'href' in a.attrs:
# storing the value of href in a separate variable
url = a.get('href')
# appending the url to the output list
urls.append(url)
# if the list does not has a anchor tag or an anchor tag
# does not has a href params we pass
except:
pass
# print all the urls stored in the urls list
for url in urls:
print(url)
第 2 步:我们将对所需的 URL 执行 get 请求,并将其中的所有文本传递到 BeautifuLSoup 并将其转换为汤对象。我们将解析器设置为 html.parser。您可以根据要抓取的网页进行设置。
蟒蛇3
# perform get request to the url
reqs = requests.get(URL)
# extract all the text that you received
# from the GET request
content = reqs.text
# convert the text to a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')
第 3 步:创建一个空列表来存储您将作为所需输出接收的所有 URL。运行一个 for 循环,遍历网页中的所有
蟒蛇3
# Empty list to store the output
urls = []
# For loop that iterates over all the tags
for h in soup.findAll('li'):
# looking for anchor tag inside the tag
a = h.find('a')
try:
# looking for href inside anchor tag
if 'href' in a.attrs:
# storing the value of href in a separate
# variable
url = a.get('href')
# appending the url to the output list
urls.append(url)
# if the list does not has a anchor tag or an anchor
# tag does not has a href params we pass
except:
pass
第 4 步:我们通过遍历 url 列表来打印输出。
蟒蛇3
# print all the urls stored in the urls list
for url in urls:
print(url)
完整代码:
蟒蛇3
# Importing libraries
import requests
from bs4 import BeautifulSoup
# setting up the URL
URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
# perform get request to the url
reqs = requests.get(URL)
# extract all the text that you received from
# the GET request
content = reqs.text
# convert the text to a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')
# Empty list to store the output
urls = []
# For loop that iterates over all the tags
for h in soup.findAll('li'):
# looking for anchor tag inside the tag
a = h.find('a')
try:
# looking for href inside anchor tag
if 'href' in a.attrs:
# storing the value of href in a separate variable
url = a.get('href')
# appending the url to the output list
urls.append(url)
# if the list does not has a anchor tag or an anchor tag
# does not has a href params we pass
except:
pass
# print all the urls stored in the urls list
for url in urls:
print(url)
输出: