📜  使用 BeautifulSoup 导航

📅  最后修改于: 2022-05-13 01:55:19.423000             🧑  作者: Mango

使用 BeautifulSoup 导航

BeautifulSoup是一个用于解析 HTML 和 XML 文档的Python包,它为解析的分页创建一个解析树,可用于网页抓取,它从 HTML 和 XML 文件中提取数据,并与您最喜欢的解析器配合使用以提供惯用的导航方式、搜索和修改解析树。

安装

这个模块没有内置于Python。要安装此类型,请在终端中输入以下命令。

pip install bs4

使用 BeautifulSoup 导航

下面的代码片段是我们将使用的 HTML 文档,以使用此代码片段作为参考使用 BeautifulSoup 标签进行导航。

Python3
ht_doc = """
 
Geeks For Geeks
 

 

most viewed courses in GFG,its all free

       

Top 5 Popular Programming Languages

        Java c/c++ Python Javascript Ruby        

according to an online survey.

       

Programming Languages

          """


Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
print(soup.head)
print(soup.title)


Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# retrieving b tag element
print(soup.body.b)
 
# retrieving a tag leement from BeautifulSoup assigned variable
print(soup.a)
 
# retrieving all elements tagged with a in ht_doc
print(soup.find_all("a"))


Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# assigning head tag of BeautifulSoup variable
hTag = soup.head
print(hTag)
 
# retrieving contents of BeautifulSoup variable
print(hTag.contents)


Python3
# embedding html document inyto BeautifulSoup variable
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# assigning head element of BeautifulSoup-assigned Variable
htag=soup.head
 
# iterating through child in descendants of htag variable
for child in htag.descendants:
    print(child)


Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
htag = soup.head
print(htag.string)


Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
for string in soup.strings :
    print(repr(string))


Python3
# embedding HTML document in BeautifulSoup-assigned variable
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# iterating through string in stripped_strings of
# BeautifulSoup assigned variable
for string in soup.stripped_strings :
    print(repr(string))


Python3
ht_doc = """
Geeks For Geeks

most viewed courses in GFG,its all free

     

Top 5 Popular Programming Languages

      Java c/c++ Python Javascript Ruby according to an online survey.

Programming Languages

      """ from bs4 import BeautifulSoup soup = BeautifulSoup(ht_doc, 'html.parser')   # embedding html document Itag = soup.title   # assigning title tag of BeautifulSoup-assigned variable # to print parent element in Itag variable print(Itag.parent) htmlTag = soup.html print(type(htmlTag.parent)) print(soup.parent)


Python3
# embedding html doc into BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# embedding a tag into link variable
link = soup.a
print(link)
 
# iterating through parent in link variable
for parent in link.parents :
     
    # printing statement for Parent is empty case
    if parent is None :
        print(parent)
    else :
        print(parent.name)


Python3
from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("Geeks For GeeksThe \
Biggest Online Tutorials Library, It's all Free")
 
# to retrieve next sibling of b tag
print(sibling_soup.b.next_sibling)
 
# for retrieving previous sibling of c tag
print(sibling_soup.c.previous_sibling)


现在让我们通过在上面的代码片段上应用Python中的 BeautifulSoup 以所有可能的方式导航,Html 文档中最重要的组件是标签,它也可能包含其他标签/字符串(标签的子标签)。 BeautifulSoup 提供了不同的方式来迭代这些孩子,让我们看看所有可能的情况



向下导航

使用标签名称导航:

示例 1:获取 Head Tag。

使用 .head 到 BeautifulSoup 对象来获取 HTML 文档中的 head 标签。

示例 2:获取标题标签

使用 .title 标签检索嵌入在 BeautifulSoup 变量中的 HTML 文档的标题

代码:

蟒蛇3



soup = BeautifulSoup(ht_doc, 'html.parser')
print(soup.head)
print(soup.title)


输出:

Geeks For Geeks
Geeks For Geeks

示例 3:获取特定标签。

我们可以检索一些特定的标签,例如 body 标签中的第一个 标签

使用标签名称作为属性将使您获得该名称的名字

通过使用find_all,我们可以得到与该属性关联的所有内容

代码:

蟒蛇3



soup = BeautifulSoup(ht_doc, 'html.parser')
 
# retrieving b tag element
print(soup.body.b)
 
# retrieving a tag leement from BeautifulSoup assigned variable
print(soup.a)
 
# retrieving all elements tagged with a in ht_doc
print(soup.find_all("a"))

输出:

示例 4:内容和 .children

我们可以使用 .contents 获取列表中的标签子项。



代码:

蟒蛇3

soup = BeautifulSoup(ht_doc, 'html.parser')
 
# assigning head tag of BeautifulSoup variable
hTag = soup.head
print(hTag)
 
# retrieving contents of BeautifulSoup variable
print(hTag.contents)

输出:

Geeks For Geeks
[Geeks For Geeks]

示例 5 :.后代

.descendants 属性允许您遍历标签的所有子项,递归地 - 它的直接子项和其直接子项的子项等等......

代码:

蟒蛇3

# embedding html document inyto BeautifulSoup variable
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# assigning head element of BeautifulSoup-assigned Variable
htag=soup.head
 
# iterating through child in descendants of htag variable
for child in htag.descendants:
    print(child)

输出 :



Geeks For Geeks
Geeks For Geeks

例 6:。字符串

如果标签只有一个子级,并且该子级是 NavigableString,则子级作为 可用。字符串

但是,如果一个标签包含不止一个东西,那么不清楚是什么。 字符串应该是指,所以 . 字符串被定义为 None,我们可以在下面的代码中看到这个实际工作。

蟒蛇3

soup = BeautifulSoup(ht_doc, 'html.parser')
htag = soup.head
print(htag.string)

输出:

Geeks For Geeks

例 7:。字符串和 stripped_strings

如果标签中包含多个内容,您仍然可以只查看字符串。使用 。字符串生成器。

蟒蛇3

soup = BeautifulSoup(ht_doc, 'html.parser')
for string in soup.strings :
    print(repr(string))

输出 :

'\n'
'Geeks For Geeks'
'\n'
'\n'
'most viewed courses in GFG,its all free'
'\n'
'Top 5 Popular Programming Languages'
'\n'
'Java'
'\n'
'c/c++'
'\n'
'Python'
'\n'
'Javascript'
'\n'
'Ruby'
'\naccording to an online survey. '
'\n'
' Programming Languages'
'\n'

为了去除多余的空格,我们使用 .stripped_strings 生成器:



蟒蛇3

# embedding HTML document in BeautifulSoup-assigned variable
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# iterating through string in stripped_strings of
# BeautifulSoup assigned variable
for string in soup.stripped_strings :
    print(repr(string))

输出:

'Geeks For Geeks'
'most viewed courses in GFG,its all free'
'Top 5 Popular Programming Languages'
'Java'
'c/c++'
'Python'
'Javascript'
'Ruby'
'according to an online survey.'
'Programming Languages'

通过 BeautifulSoup 向上导航:

如果我们考虑一个“家谱”类比,每个标签和每个字符串都有一个父级:包含它的标签:

示例 1: .parent。

.parent 标签用于检索元素的父元素

代码:

蟒蛇3

ht_doc = """
Geeks For Geeks

most viewed courses in GFG,its all free

     

Top 5 Popular Programming Languages

      Java c/c++ Python Javascript Ruby according to an online survey.

Programming Languages

      """ from bs4 import BeautifulSoup soup = BeautifulSoup(ht_doc, 'html.parser')   # embedding html document Itag = soup.title   # assigning title tag of BeautifulSoup-assigned variable # to print parent element in Itag variable print(Itag.parent) htmlTag = soup.html print(type(htmlTag.parent)) print(soup.parent)

输出:

Geeks For Geeks

None

示例 2: .parents



为了遍历父元素,可以使用 .parents 标签:

蟒蛇3

# embedding html doc into BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
 
# embedding a tag into link variable
link = soup.a
print(link)
 
# iterating through parent in link variable
for parent in link.parents :
     
    # printing statement for Parent is empty case
    if parent is None :
        print(parent)
    else :
        print(parent.name)

输出:

使用 BeautifulSoup 横向导航

.next_sibling 和 .previous_sibling 是用于在解析树相同级别的页面元素之间导航的标签。

代码:

蟒蛇3

from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("Geeks For GeeksThe \
Biggest Online Tutorials Library, It's all Free")
 
# to retrieve next sibling of b tag
print(sibling_soup.b.next_sibling)
 
# for retrieving previous sibling of c tag
print(sibling_soup.c.previous_sibling)

输出:

The Biggest Online Tutorials Library, It's all Free
Geeks For Geeks