Beautiful Soup-各种物品

📌 相关文章

📜 Beautiful Soup-各种物品

📅 最后修改于: 2020-11-09 14:26:20 🧑 作者: Mango

当我们将html文档或字符串传递给beautifulsoup构造函数时，beautifulsoup基本上将复杂的html页面转换为不同的Python对象。下面我们将讨论四种主要的对象：

标签
导航字符串
美丽汤
评论

标签对象

HTML标记用于定义各种类型的内容。 BeautifulSoup中的标签对象对应于实际页面或文档中的HTML或XML标签。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('TutorialsPoint')
>>> tag = soup.html
>>> type(tag)

标签包含许多属性和方法，标签的两个重要特征是其名称和属性。

名称(tag.name)

每个标签都包含一个名称，并且可以通过后缀“ .name”进行访问。 tag.name将返回它是标签的类型。

>>> tag.name
'html'

但是，如果我们更改标签名称，则将在BeautifulSoup生成的HTML标记中反映出来。

>>> tag.name = "Strong"
>>> tag
TutorialsPoint
>>> tag.name
'Strong'

属性(tag.attrs)

标签对象可以具有任意数量的属性。标签具有属性“ class”，其值是“ boldest”。任何不是NOT标记的内容，基本上都是属性，必须包含一个值。您可以通过访问键(例如，在上面的示例中访问“类”)或通过“ .attrs”直接访问来访问属性。

>>> tutorialsP = BeautifulSoup("",'lxml') >>> tag2 = tutorialsP.div >>> tag2['class'] ['tutorialsP']

我们可以对标记的属性进行各种修改(添加/删除/修改)。

>>> tag2['class'] = 'Online-Learning' >>> tag2['style'] = '2007' >>> >>> tag2 >>> del tag2['style'] >>> tag2 >>> del tag['class'] >>> tag TutorialsPoint >>> >>> del tag['SecondAttribute'] >>> tag >>> tag2['class'] 'Online-Learning' >>> tag2['style'] KeyError: 'style'

多值属性

一些HTML5属性可以具有多个值。最常用的是class-attribute，它可以具有多个CSS值。其他包括“ rel”，“ rev”，“ headers”，“ accesskey”和“ accept-charset”。美丽汤中的多值属性显示为列表。

>>> from bs4 import BeautifulSoup >>> >>> css_soup = BeautifulSoup('') >>> css_soup.p['class'] ['body'] >>> >>> css_soup = BeautifulSoup('') >>> css_soup.p['class'] ['body', 'bold']

但是，如果任何属性包含多个值，但按HTML标准的任何版本而言，它不是多值属性，那么漂亮的汤将使该属性单独存在-

>>> id_soup = BeautifulSoup('') >>> id_soup.p['id'] 'body bold' >>> type(id_soup.p['id'])

如果将标记转换为字符串，则可以合并多个属性值。

>>> rel_soup = BeautifulSoup(" tutorialspoint Main Page") >>> rel_soup.a['rel'] ['Index'] >>> rel_soup.a['rel'] = ['Index', ' Online Library, Its all Free'] >>> print(rel_soup.p) tutorialspoint Main Page

通过使用“ get_attribute_list”，您将获得一个始终为列表字符串，而不管其是否为多值。

id_soup.p.get_attribute_list(‘id’)

但是，如果您将文档解析为“ xml”，则没有多值属性-

>>> xml_soup = BeautifulSoup('', 'xml') >>> xml_soup.p['class'] 'body bold'

导航字符串

navigablestring对象用于表示标签的内容。要访问内容，请使用“。带有标签的“字符串”。

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("Hello, Tutorialspoint!") >>> >>> soup.string 'Hello, Tutorialspoint!' >>> type(soup.string) >

您可以将字符串替换为另一个字符串，但是不能编辑现有字符串。

>>> soup = BeautifulSoup("Hello, Tutorialspoint!") >>> soup.string.replace_with("Online Learning!") 'Hello, Tutorialspoint!' >>> soup.string 'Online Learning!' >>> soup Online Learning!

美丽汤

BeautifulSoup是我们尝试抓取Web资源时创建的对象。因此，这是我们试图抓取的完整文档。大多数情况下，它被视为标记对象。

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("Hello, Tutorialspoint!") >>> type(soup) >>> soup.name '[document]'

评论

注释对象说明了Web文档的注释部分。它只是NavigableString的一种特殊类型。

>>> soup = BeautifulSoup('') >>> comment = soup.p.string >>> type(comment) >>> type(comment) >>> print(soup.p.prettify())

NavigableString对象

navigablestring对象用于表示标签内的文本，而不是标签本身。