使用 BeautifulSoup 查找第一个给定标签的文本长度
在本文中,我们将使用 BeautifulSoup 查找第一个给定标签的文本长度。
让我们看一个示例。使用 'html.parser' 对其进行解析,并在下面的代码中计算标签值 'h2' 长度。soup = BeautifulSoup(html_doc, 'html.parser')指定使用html.parser解析整个给定的 HTML 文档。 soup.find('h2').text方法采用给定文档中存在的任何有效 HTML 标记并搜索它。如果标签存在,它将完成下一组操作。如果指定的标签不存在,则会抛出“属性错误”
在这个例子中,我们关心计算长度,因此使用了len()函数。 len()函数返回对象中的项目数,对于字符串,它返回包含在该字符串的字符数。
示例 1:
在这个例子中,因为我们试图在“h2”中获取一个文本值,它只是计算包含在该字符串的字符数。
Python3
# import module
from bs4 import BeautifulSoup
# assign HTML document
html_doc = """
An example of HTML page to find the length of
the first tag
An example of HTML page to find the length of the
first tag
Beautiful Soup is a library which is essential to scrape
information from web pages.
It helps to iterate, search and modifying the parse tree.
"""
# create beautiful soap object
soup = BeautifulSoup(html_doc, 'html.parser')
# get length
print("Length of the text of the first tag:")
print(len(soup.find('h2').text))
Python3
# import module
from bs4 import BeautifulSoup
# assign html document
html_doc = """
An example of HTML page to find the length of
the first tag
An example of HTML page to find the length of the
first tag
Beautiful Soup is a library which is essential to scrape
information from web pages.
It helps to iterate, search and modifying the parse tree.
"""
# create beautiful sopa object
soup = BeautifulSoup(html_doc, 'html.parser')
# Get all the tags present in the html and
# getting their length
for tag in soup.findAll(True):
print(tag.name, " : ", len(soup.find(tag.name).text))
Python3
# get length of first tag only
for tag in soup.findAll(True):
print(tag.name, " : ", len(soup.find(tag.name).text))
break
Python3
# import module
from bs4 import BeautifulSoup
# assign HTML document
html_doc = """
An example of HTML page to find the length of
the first tag
An example of HTML page to find the length of the
first tag
Beautiful Soup is a library which is essential to scrape
information from web pages.
It helps to iterate, search and modifying the parse tree.
"""
# create beautiful soap object
soup = BeautifulSoup(html_doc, 'html.parser')
# assign tag
tag = "html"
# get length
print("Length of the text of", tag, "tag is:",
len(soupResults.find(tag).text))
Python3
# import module
from bs4 import BeautifulSoup
import requests
# assign URL
monsterPageURL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
monsterPage = requests.get(monsterPageURL)
# create Beautiful Soup object
soupResults = BeautifulSoup(monsterPage.content, 'html.parser')
# assign tag
tag="title"
# get length of the tags
print("Length of the text of",tag,"tag is:",
len(soupResults.find(tag).text))
输出:
Length of the text of the first tag:
59
汤.find().text语句检索包含在特定标签之间的文本。然后len()函数返回文本的长度。
示例 2:
获取给定 HTML 中存在的所有 HTML 标签的长度。
蟒蛇3
# import module
from bs4 import BeautifulSoup
# assign html document
html_doc = """
An example of HTML page to find the length of
the first tag
An example of HTML page to find the length of the
first tag
Beautiful Soup is a library which is essential to scrape
information from web pages.
It helps to iterate, search and modifying the parse tree.
"""
# create beautiful sopa object
soup = BeautifulSoup(html_doc, 'html.parser')
# Get all the tags present in the html and
# getting their length
for tag in soup.findAll(True):
print(tag.name, " : ", len(soup.find(tag.name).text))
输出:
findAll(True)方法直到有标签,它才会找到它们。 soup.findAll(True):语句中的for 标签迭代所有找到的标签,最后语句print(tag.name, ” : “, len(soup.find(tag.name).text))显示标签一一以及它的长度。
如果我们明确想要得到 第一个标签的意思是,在上面的代码中,我们需要在print语句之后放置一个break语句。
蟒蛇3
# get length of first tag only
for tag in soup.findAll(True):
print(tag.name, " : ", len(soup.find(tag.name).text))
break
输出:
html : 270
示例 3:
在此示例中,我们将从 HTML 文档中找到特定给定标签的文本长度。
蟒蛇3
# import module
from bs4 import BeautifulSoup
# assign HTML document
html_doc = """
An example of HTML page to find the length of
the first tag
An example of HTML page to find the length of the
first tag
Beautiful Soup is a library which is essential to scrape
information from web pages.
It helps to iterate, search and modifying the parse tree.
"""
# create beautiful soap object
soup = BeautifulSoup(html_doc, 'html.parser')
# assign tag
tag = "html"
# get length
print("Length of the text of", tag, "tag is:",
len(soupResults.find(tag).text))
输出:
Length of the text of html tag is: 5062
示例 4:
现在让我们看看如何从一个类似怪物的网页中获取标签及其文本长度。由于我们需要从这个请求 URL 中获取数据,我们需要包含requests模块来实现相同的功能。
蟒蛇3
# import module
from bs4 import BeautifulSoup
import requests
# assign URL
monsterPageURL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
monsterPage = requests.get(monsterPageURL)
# create Beautiful Soup object
soupResults = BeautifulSoup(monsterPage.content, 'html.parser')
# assign tag
tag="title"
# get length of the tags
print("Length of the text of",tag,"tag is:",
len(soupResults.find(tag).text))
输出:
Length of the text of title tag is: 57