如何使用Python从本地 HTML 文件中抓取数据?
Python的BeautifulSoup模块允许我们从本地 HTML 文件中抓取数据。出于某种原因,网站页面可能会存储在本地(离线环境)中,并且在需要时,可能需要从中获取数据。有时也可能需要从多个本地存储的 HTML 文件中获取数据。通常 HTML 文件有
、、…
、
安装
可以通过在终端中键入以下命令来安装它。
pip install beautifulsoup4
入门
如果有一个 HTML 文件存储在一个位置,并且我们需要使用 BeautifulSoup 通过Python抓取内容, lxml是一个很好的 API,因为它意味着解析 XML 和 HTML。它支持一步解析和逐步解析。
BeautifulSoup 中的Prettify()函数有助于查看标签性质及其嵌套。
示例:让我们创建一个示例 HTML 文件。
Python3
# Necessary imports
import sys
import urllib.request
# Save a reference to the original
# standard output
original_stdout = sys.stdout
# as an example, taken my article list
# published link page and stored in local
with urllib.request.urlopen('https://auth.geeksforgeeks.org/user/priyarajtt/articles') as webPageResponse:
outputHtml = webPageResponse.read()
# Scraped contents are placed in
# samplehtml.html file and getting
# used for next set of examples
with open('samplehtml.html', 'w') as f:
# Here the standard output is
# written to the file that we
# used above
sys.stdout = f
print(outputHtml)
# Reset the standard output to its
# original value
sys.stdout = original_stdout
Python3
# Importing BeautifulSoup and
# it is in the bs4 module
from bs4 import BeautifulSoup
# Opening the html file. If the file
# is present in different location,
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
# Creating a BeautifulSoup object and
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
# Using the prettify method to modify the code
# Prettify() function in BeautifulSoup helps
# to view about the tag nature and their nesting
print(beautifulSoupText.body.prettify())
Python3
# Importing BeautifulSoup and it
# is in the bs4 module
from bs4 import BeautifulSoup
# Opening the html file. If the file
# is present in different location,
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
# Creating a BeautifulSoup object and
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
# To get all the tags present in the html
# and getting their length
for tag in beautifulSoupText.findAll(True):
print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))
Python3
# necessary import for getting
# directory and filenames
import os
from bs4 import BeautifulSoup
# Get current working directory
directory = os.getcwd()
# for all the files present in that
# directory
for filename in os.listdir(directory):
# check whether the file is having
# the extension as html and it can
# be done with endswith function
if filename.endswith('.html'):
# os.path.join() method in Python join
# one or more path components which helps
# to exactly get the file
fname = os.path.join(directory, filename)
print("Current file name ..", os.path.abspath(fname))
# open the file
with open(fname, 'r') as file:
beautifulSoupText = BeautifulSoup(file.read(), 'html.parser')
# parse the html as you wish
for tag in beautifulSoupText.findAll(True):
print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))
输出:
现在,使用 prettify() 方法以更简单的方式查看标签和内容。
蟒蛇3
# Importing BeautifulSoup and
# it is in the bs4 module
from bs4 import BeautifulSoup
# Opening the html file. If the file
# is present in different location,
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
# Creating a BeautifulSoup object and
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
# Using the prettify method to modify the code
# Prettify() function in BeautifulSoup helps
# to view about the tag nature and their nesting
print(beautifulSoupText.body.prettify())
输出 :
这样就可以得到HTML数据。现在做一些操作,对数据进行一些洞察。
示例 1:
我们可以使用 find() 方法,随着 HTML 内容的动态变化,我们可能不知道确切的标签名称。那个时候,我们可以先使用 findAll(True) 来获取标签名,然后我们可以做任何类型的操作。例如获取标签的标签名称和长度
蟒蛇3
# Importing BeautifulSoup and it
# is in the bs4 module
from bs4 import BeautifulSoup
# Opening the html file. If the file
# is present in different location,
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
# Creating a BeautifulSoup object and
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
# To get all the tags present in the html
# and getting their length
for tag in beautifulSoupText.findAll(True):
print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))
输出:
示例 2:
现在,我们不想抓取一个 HTML 文件,而是要对该目录中存在的所有 HTML 文件进行处理(可能有必要,例如每天,特定目录可能会填充在线数据并作为批处理,必须进行刮擦)。
我们可以使用“os”模块功能。让我们将当前目录中的所有 HTML 文件作为我们的示例
所以我们的任务是让所有 HTML 文件都被废弃。通过下面的方式,我们可以实现。整个文件夹的 HTML 文件被一一抓取,并检索所有文件的标签长度,并在所附视频中进行了展示。
蟒蛇3
# necessary import for getting
# directory and filenames
import os
from bs4 import BeautifulSoup
# Get current working directory
directory = os.getcwd()
# for all the files present in that
# directory
for filename in os.listdir(directory):
# check whether the file is having
# the extension as html and it can
# be done with endswith function
if filename.endswith('.html'):
# os.path.join() method in Python join
# one or more path components which helps
# to exactly get the file
fname = os.path.join(directory, filename)
print("Current file name ..", os.path.abspath(fname))
# open the file
with open(fname, 'r') as file:
beautifulSoupText = BeautifulSoup(file.read(), 'html.parser')
# parse the html as you wish
for tag in beautifulSoupText.findAll(True):
print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))
输出: