如何使用Python BeautifulSoup 将输出写入 HTML 文件?
在本文中,我们将使用Python BeautifulSoup 将输出写入 HTML 文件。 BeautifulSoup 是一个主要用于网页抓取的Python库,但在本文中,我们将讨论如何将输出写入 HTML 文件。
需要的模块和安装:
pip install bs4
方法:
- 我们将首先导入所有必需的库。
- 向所需的 URL 发出 get 请求并提取其页面内容。
- 使用Python的文件数据类型将输出写入一个新文件。
应遵循的步骤:
第 1 步:导入所需的库。
Python3
# Import libraries
from bs4 import BeautifulSoup
import requests
Python3
# set the url to perform the get request
URL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
page = requests.get(URL)
# load the page content
text = page.content
# make a soup object by using beautiful
# soup and set the markup as html parser
soup = BeautifulSoup(text, "html.parser")
Python3
# open the file in w mode
# set encoding to UTF-8
with open("output.html", "w", encoding = 'utf-8') as file:
# prettify the soup object and convert it into a string
file.write(str(soup.prettify()))
Python3
# Import libraries
from bs4 import BeautifulSoup
import requests
# set the url to perform the get request
URL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
page = requests.get(URL)
# load the page content
text = page.content
# make a soup object by using
# beautiful soup and set the markup as html parser
soup = BeautifulSoup(text, "html.parser")
# open the file in w mode
# set encoding to UTF-8
with open("output.html", "w", encoding = 'utf-8') as file:
# prettify the soup object and convert it into a string
file.write(str(soup.prettify()))
第2步:我们将对Google搜索引擎主页执行get请求并提取其页面内容并通过将其传递给beautiful soup从中制作一个soup对象,我们将标记设置为html.parser。
注意:如果您要提取 xml 页面,请将标记设置为 xml.parser
蟒蛇3
# set the url to perform the get request
URL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
page = requests.get(URL)
# load the page content
text = page.content
# make a soup object by using beautiful
# soup and set the markup as html parser
soup = BeautifulSoup(text, "html.parser")
第三步:我们使用Python的文件数据类型,在输出文件中写入soup对象。我们将编码设置为 UTF-8。我们将在汤对象上使用 .prettify()函数,这将使其更易于阅读。在写入之前,我们将把汤对象转换为字符串。
我们将输出文件存储在名称为 output.html 的同一目录中
蟒蛇3
# open the file in w mode
# set encoding to UTF-8
with open("output.html", "w", encoding = 'utf-8') as file:
# prettify the soup object and convert it into a string
file.write(str(soup.prettify()))
下面是完整的实现:
蟒蛇3
# Import libraries
from bs4 import BeautifulSoup
import requests
# set the url to perform the get request
URL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
page = requests.get(URL)
# load the page content
text = page.content
# make a soup object by using
# beautiful soup and set the markup as html parser
soup = BeautifulSoup(text, "html.parser")
# open the file in w mode
# set encoding to UTF-8
with open("output.html", "w", encoding = 'utf-8') as file:
# prettify the soup object and convert it into a string
file.write(str(soup.prettify()))
输出: