使用 BeautifulSoup 导航
BeautifulSoup是一个用于解析 HTML 和 XML 文档的Python包,它为解析的分页创建一个解析树,可用于网页抓取,它从 HTML 和 XML 文件中提取数据,并与您最喜欢的解析器配合使用以提供惯用的导航方式、搜索和修改解析树。
安装
这个模块没有内置于Python。要安装此类型,请在终端中输入以下命令。
pip install bs4
使用 BeautifulSoup 导航
下面的代码片段是我们将使用的 HTML 文档,以使用此代码片段作为参考使用 BeautifulSoup 标签进行导航。
Python3
ht_doc = """
Geeks For Geeks
most viewed courses in GFG,its all free
Top 5 Popular Programming Languages
Java
c/c++
Python
Javascript
Ruby
according to an online survey.
Programming Languages
"""
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
print(soup.head)
print(soup.title)
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
# retrieving b tag element
print(soup.body.b)
# retrieving a tag leement from BeautifulSoup assigned variable
print(soup.a)
# retrieving all elements tagged with a in ht_doc
print(soup.find_all("a"))
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
# assigning head tag of BeautifulSoup variable
hTag = soup.head
print(hTag)
# retrieving contents of BeautifulSoup variable
print(hTag.contents)
Python3
# embedding html document inyto BeautifulSoup variable
soup = BeautifulSoup(ht_doc, 'html.parser')
# assigning head element of BeautifulSoup-assigned Variable
htag=soup.head
# iterating through child in descendants of htag variable
for child in htag.descendants:
print(child)
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
htag = soup.head
print(htag.string)
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
for string in soup.strings :
print(repr(string))
Python3
# embedding HTML document in BeautifulSoup-assigned variable
soup = BeautifulSoup(ht_doc, 'html.parser')
# iterating through string in stripped_strings of
# BeautifulSoup assigned variable
for string in soup.stripped_strings :
print(repr(string))
Python3
ht_doc = """
Geeks For Geeks
most viewed courses in GFG,its all free
Top 5 Popular Programming Languages
Java
c/c++
Python
Javascript
Ruby
according to an online survey.
Programming Languages
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
# embedding html document
Itag = soup.title
# assigning title tag of BeautifulSoup-assigned variable
# to print parent element in Itag variable
print(Itag.parent)
htmlTag = soup.html
print(type(htmlTag.parent))
print(soup.parent)
Python3
# embedding html doc into BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
# embedding a tag into link variable
link = soup.a
print(link)
# iterating through parent in link variable
for parent in link.parents :
# printing statement for Parent is empty case
if parent is None :
print(parent)
else :
print(parent.name)
Python3
from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("Geeks For GeeksThe \
Biggest Online Tutorials Library, It's all Free ")
# to retrieve next sibling of b tag
print(sibling_soup.b.next_sibling)
# for retrieving previous sibling of c tag
print(sibling_soup.c.previous_sibling)
现在让我们通过在上面的代码片段上应用Python中的 BeautifulSoup 以所有可能的方式导航,Html 文档中最重要的组件是标签,它也可能包含其他标签/字符串(标签的子标签)。 BeautifulSoup 提供了不同的方式来迭代这些孩子,让我们看看所有可能的情况
向下导航
使用标签名称导航:
示例 1:获取 Head Tag。
使用 .head 到 BeautifulSoup 对象来获取 HTML 文档中的 head 标签。
Syntax : (BeautifulSoup Variable).head
示例 2:获取标题标签
使用 .title 标签检索嵌入在 BeautifulSoup 变量中的 HTML 文档的标题
Syntax : (BeautifulSoup Variable).title
代码:
蟒蛇3
soup = BeautifulSoup(ht_doc, 'html.parser')
print(soup.head)
print(soup.title)
输出:
Geeks For Geeks
Geeks For Geeks
示例 3:获取特定标签。
我们可以检索一些特定的标签,例如 body 标签中的第一个 标签
Syntax : (BeautifulSoup Variable).body.b
使用标签名称作为属性将使您获得该名称的名字
Syntax: (BeautifulSoup Variable).(tag attribute)
通过使用find_all,我们可以得到与该属性关联的所有内容
Syntax: (BeautifulSoup Variable).find_all(tag value)
代码:
蟒蛇3
soup = BeautifulSoup(ht_doc, 'html.parser')
# retrieving b tag element
print(soup.body.b)
# retrieving a tag leement from BeautifulSoup assigned variable
print(soup.a)
# retrieving all elements tagged with a in ht_doc
print(soup.find_all("a"))
输出:
示例 4:内容和 .children
我们可以使用 .contents 获取列表中的标签子项。
Syntax: (BeautifulSoup Variable).contents
代码:
蟒蛇3
soup = BeautifulSoup(ht_doc, 'html.parser')
# assigning head tag of BeautifulSoup variable
hTag = soup.head
print(hTag)
# retrieving contents of BeautifulSoup variable
print(hTag.contents)
输出:
Geeks For Geeks
[Geeks For Geeks ]
示例 5 :.后代
.descendants 属性允许您遍历标签的所有子项,递归地 - 它的直接子项和其直接子项的子项等等......
Syntax: (Variable assigned from BeautifulSoup Variable).descendants
代码:
蟒蛇3
# embedding html document inyto BeautifulSoup variable
soup = BeautifulSoup(ht_doc, 'html.parser')
# assigning head element of BeautifulSoup-assigned Variable
htag=soup.head
# iterating through child in descendants of htag variable
for child in htag.descendants:
print(child)
输出 :
Geeks For Geeks
Geeks For Geeks
例 6:。字符串
如果标签只有一个子级,并且该子级是 NavigableString,则子级作为 可用。字符串
但是,如果一个标签包含不止一个东西,那么不清楚是什么。 字符串应该是指,所以 . 字符串被定义为 None,我们可以在下面的代码中看到这个实际工作。
蟒蛇3
soup = BeautifulSoup(ht_doc, 'html.parser')
htag = soup.head
print(htag.string)
输出:
Geeks For Geeks
例 7:。字符串和 stripped_strings
如果标签中包含多个内容,您仍然可以只查看字符串。使用 。字符串生成器。
蟒蛇3
soup = BeautifulSoup(ht_doc, 'html.parser')
for string in soup.strings :
print(repr(string))
输出 :
'\n'
'Geeks For Geeks'
'\n'
'\n'
'most viewed courses in GFG,its all free'
'\n'
'Top 5 Popular Programming Languages'
'\n'
'Java'
'\n'
'c/c++'
'\n'
'Python'
'\n'
'Javascript'
'\n'
'Ruby'
'\naccording to an online survey. '
'\n'
' Programming Languages'
'\n'
为了去除多余的空格,我们使用 .stripped_strings 生成器:
蟒蛇3
# embedding HTML document in BeautifulSoup-assigned variable
soup = BeautifulSoup(ht_doc, 'html.parser')
# iterating through string in stripped_strings of
# BeautifulSoup assigned variable
for string in soup.stripped_strings :
print(repr(string))
输出:
'Geeks For Geeks'
'most viewed courses in GFG,its all free'
'Top 5 Popular Programming Languages'
'Java'
'c/c++'
'Python'
'Javascript'
'Ruby'
'according to an online survey.'
'Programming Languages'
通过 BeautifulSoup 向上导航:
如果我们考虑一个“家谱”类比,每个标签和每个字符串都有一个父级:包含它的标签:
示例 1: .parent。
.parent 标签用于检索元素的父元素
Syntax : (BeautifulSoup Variable).parent
代码:
蟒蛇3
ht_doc = """
Geeks For Geeks
most viewed courses in GFG,its all free
Top 5 Popular Programming Languages
Java
c/c++
Python
Javascript
Ruby
according to an online survey.
Programming Languages
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
# embedding html document
Itag = soup.title
# assigning title tag of BeautifulSoup-assigned variable
# to print parent element in Itag variable
print(Itag.parent)
htmlTag = soup.html
print(type(htmlTag.parent))
print(soup.parent)
输出:
Geeks For Geeks
None
示例 2: .parents
为了遍历父元素,可以使用 .parents 标签:
Syntax :(BeautifulSoup Variable).parents
蟒蛇3
# embedding html doc into BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
# embedding a tag into link variable
link = soup.a
print(link)
# iterating through parent in link variable
for parent in link.parents :
# printing statement for Parent is empty case
if parent is None :
print(parent)
else :
print(parent.name)
输出:
使用 BeautifulSoup 横向导航
.next_sibling 和 .previous_sibling 是用于在解析树相同级别的页面元素之间导航的标签。
Syntax:
(BeautifulSoup Variable).(tag attribute).next_sibling
(BeautifulSoup Variable).(tag attribute).previous_sibling
代码:
蟒蛇3
from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("Geeks For GeeksThe \
Biggest Online Tutorials Library, It's all Free ")
# to retrieve next sibling of b tag
print(sibling_soup.b.next_sibling)
# for retrieving previous sibling of c tag
print(sibling_soup.c.previous_sibling)
输出:
The Biggest Online Tutorials Library, It's all Free
Geeks For Geeks