使用 BeautifulSoup 提取给定标签及其父标签的 HTML 代码
在本文中,我们将讨论如何使用 BeautifulSoup 提取给定标签及其父标签的 HTML 代码。
需要的模块
首先,我们需要在我们的计算机上安装所有这些模块。
- BeautifulSoup:我们的主要模块包含一个通过 HTTP 访问网页的方法。
pip install bs4
- lxml:用Python语言处理网页的助手库。
pip install lxml
- 请求:使发送 HTTP 请求的过程完美无缺。函数的输出。
pip install requests
抓取示例网站
- 我们导入了 beautifulsoup 模块和请求。我们声明了 Header 并添加了一个用户代理。这确保了我们要抓取的目标网站不会将来自我们程序的流量视为垃圾邮件,并最终被它们阻止。
Python3
# importing the modules
from bs4 import BeautifulSoup
import requests
# URL to the scraped
URL = "https://en.wikipedia.org/wiki/Machine_learning"
# getting the contents of the website and parsing them
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "lxml")
Python3
# getting the h1 with id as firstHeading and printing it
title = soup.find("h1", attrs={"id": 'firstHeading'})
print(title)
Python3
# getting the text/content inside the h1 tag we
# parsed on the previous line
cont = title.get_text()
print(cont)
Python3
# getting the HTML of the parent parent of
# the h1 tag we parsed earlier
parent = soup.find("span",
attrs={"id": 'Machine_learning_approaches'}).parent()
print(parent)
Python3
# importing the modules
from bs4 import BeautifulSoup
import requests
# URL to the scraped
URL = "https://en.wikipedia.org/wiki/Machine_learning"
# getting the contents of the website and parsing them
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "lxml")
# getting the h1 with id as firstHeading and printing it
title = soup.find("h1", attrs={"id": 'firstHeading'})
print(title)
# getting the text/content inside the h1 tag we
# parsed on the previous line
cont = title.get_text()
print(cont)
# getting the HTML of the parent parent of
# the h1 tag we parsed earlier
parent = soup.find("span",
attrs={"id": 'Machine_learning_approaches'}).parent()
print(parent)
- 现在定位您想要获取信息的元素,右键单击它并单击检查元素。然后从检查元素窗口尝试查找其他人独有的 HTML 属性。大多数时候它是元素的Id。
这里要提取网站标题的 HTML,我们可以使用标题的 id 轻松提取。
蟒蛇3
# getting the h1 with id as firstHeading and printing it
title = soup.find("h1", attrs={"id": 'firstHeading'})
print(title)
- 现在提取相关标签的内容,我们可以简单地使用 .get_text() 方法。实现如下:
蟒蛇3
# getting the text/content inside the h1 tag we
# parsed on the previous line
cont = title.get_text()
print(cont)
- 现在为了提取相关元素的父元素的 HTML,让我们以 ID 为“Machine_learning_approaches”的 span 为例。
我们需要提取它以列表形式显示 HTML。
蟒蛇3
# getting the HTML of the parent parent of
# the h1 tag we parsed earlier
parent = soup.find("span",
attrs={"id": 'Machine_learning_approaches'}).parent()
print(parent)
下面是完整的程序:
蟒蛇3
# importing the modules
from bs4 import BeautifulSoup
import requests
# URL to the scraped
URL = "https://en.wikipedia.org/wiki/Machine_learning"
# getting the contents of the website and parsing them
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "lxml")
# getting the h1 with id as firstHeading and printing it
title = soup.find("h1", attrs={"id": 'firstHeading'})
print(title)
# getting the text/content inside the h1 tag we
# parsed on the previous line
cont = title.get_text()
print(cont)
# getting the HTML of the parent parent of
# the h1 tag we parsed earlier
parent = soup.find("span",
attrs={"id": 'Machine_learning_approaches'}).parent()
print(parent)
输出: