使用 BeautifulSoup 检索 html 标签的孩子
先决条件: Beautifulsoup
Beautifulsoup 是一个用于网页抓取的Python模块。本文讨论如何抓取和显示给定 HTML 标签的子标签。
示例网站: https : //www.geeksforgeeks.org/caching-page-tables/
对于第一个孩子:
方法
- 导入模块
- 传递网址
- 请求页面
- 使用 findChild()函数显示第一个孩子
句法:
findChild()
例子:
Python3
from bs4 import BeautifulSoup
import requests
# sample web page
sample_web_page = 'https://www.geeksforgeeks.org/caching-page-tables/'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
child_soup = soup.find('p')
print("child : ", child_soup.findChild())
Python3
from bs4 import BeautifulSoup
import requests
# sample web page
sample_web_page = 'https://www.geeksforgeeks.org/caching-page-tables/'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
child_soup = soup.find('p')
for i in child_soup.children:
print("child : ", i)
Python3
from bs4 import BeautifulSoup
import requests
# sample web page
sample_web_page = 'https://www.geeksforgeeks.org/caching-page-tables/'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
child_soup = soup.find('p')
print("child : ", child_soup.contents)
输出:
对于所有孩子:
对于检索 HTML 标记的子项,我们必须选择使用.children或.contents 。孩子和内容之间的区别是孩子不占用任何内存,它为我们提供了一个可迭代列表,内容给了孩子标签,但它使用了内存。对于大型 HTML 文件,使用 children 是更好的选择,并且对于存储价值需要的内容会更好。
方法
- 导入模块
- 传递网站网址
- 请求页面
- 使用任一关键字显示子标签
使用 .children:
For Retrieve all the children .children 将使用。
例子:
蟒蛇3
from bs4 import BeautifulSoup
import requests
# sample web page
sample_web_page = 'https://www.geeksforgeeks.org/caching-page-tables/'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
child_soup = soup.find('p')
for i in child_soup.children:
print("child : ", i)
输出:
child : Paging
child : is a memory management scheme which allows physical address space of a process to be non-contiguous. The basic idea of paging is to break physical memory into fixed-size blocks called
child : frames
child : and logical memory into blocks of same size called
child : pages
child : . While executing the process the required pages of that process are loaded into available frames from their source that is disc or any backup storage device.
使用 .contents
它还将返回所有子标签并将它们存储在内存中。
例子
蟒蛇3
from bs4 import BeautifulSoup
import requests
# sample web page
sample_web_page = 'https://www.geeksforgeeks.org/caching-page-tables/'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
child_soup = soup.find('p')
print("child : ", child_soup.contents)
输出
child : [Paging, ‘ is a memory management scheme which allows physical address space of a process to be non-contiguous. The basic idea of paging is to break physical memory into fixed-size blocks called ‘, frames, ‘ and logical memory into blocks of same size called ‘, pages, ‘. While executing the process the required pages of that process are loaded into available frames from their source that is disc or any backup storage device.’]