如何使用Python从本地 HTML 文件中抓取数据？

Python的BeautifulSoup模块允许我们从本地 HTML 文件中抓取数据。出于某种原因，网站页面可能会存储在本地（离线环境）中，并且在需要时，可能需要从中获取数据。有时也可能需要从多个本地存储的 HTML 文件中获取数据。通常 HTML 文件有

、

、…

、

等标签，使用 BeautifulSoup，我们可以抓取内容并获取必要的细节。

安装

可以通过在终端中键入以下命令来安装它。

pip install beautifulsoup4

入门

如果有一个 HTML 文件存储在一个位置，并且我们需要使用 BeautifulSoup 通过Python抓取内容， lxml是一个很好的 API，因为它意味着解析 XML 和 HTML。它支持一步解析和逐步解析。

BeautifulSoup 中的Prettify()函数有助于查看标签性质及其嵌套。

示例：让我们创建一个示例 HTML 文件。

Python3

# Necessary imports
import sys
import urllib.request
  
# Save a reference to the original
# standard output
original_stdout = sys.stdout
  
# as an example, taken my article list
# published link page and stored in local
with urllib.request.urlopen('https://auth.geeksforgeeks.org/user/priyarajtt/articles') as webPageResponse:
    outputHtml = webPageResponse.read()
  
# Scraped contents are placed in 
# samplehtml.html file and getting
# used for next set of examples
with open('samplehtml.html', 'w') as f:
      
    # Here the  standard output is 
    # written to the file that we 
    # used above
    sys.stdout = f
    print(outputHtml)
      
    # Reset the standard output to its 
    # original value
    sys.stdout = original_stdout

Python3

# Importing BeautifulSoup and 
# it is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and
# specifying the parser 
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
  
# Using the prettify method to modify the code
#  Prettify() function in BeautifulSoup helps
# to view about the tag nature and their nesting
print(beautifulSoupText.body.prettify())

Python3

# Importing BeautifulSoup and it
# is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and 
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
# To get all the tags present in the html 
# and getting their length
for tag in beautifulSoupText.findAll(True):
    print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

Python3

# necessary import for getting
# directory and filenames
import os
from bs4 import BeautifulSoup
  
# Get current working directory
directory = os.getcwd()
  
# for all the files present in that
# directory
for filename in os.listdir(directory):
      
    # check whether the file is having
    # the extension as html and it can
    # be done with endswith function
    if filename.endswith('.html'):
          
        # os.path.join() method in Python join
        # one or more path components which helps
        # to exactly get the file
        fname = os.path.join(directory, filename)
        print("Current file name ..", os.path.abspath(fname))
          
        # open the file
        with open(fname, 'r') as file:
            
            beautifulSoupText = BeautifulSoup(file.read(), 'html.parser')
              
            # parse the html as you wish
            for tag in beautifulSoupText.findAll(True):
                print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

输出：

现在，使用 prettify() 方法以更简单的方式查看标签和内容。

蟒蛇3

# Importing BeautifulSoup and 
# it is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and
# specifying the parser 
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
  
# Using the prettify method to modify the code
#  Prettify() function in BeautifulSoup helps
# to view about the tag nature and their nesting
print(beautifulSoupText.body.prettify())

输出：

这样就可以得到HTML数据。现在做一些操作，对数据进行一些洞察。

示例 1：

我们可以使用 find() 方法，随着 HTML 内容的动态变化，我们可能不知道确切的标签名称。那个时候，我们可以先使用 findAll(True) 来获取标签名，然后我们可以做任何类型的操作。例如获取标签的标签名称和长度

蟒蛇3

# Importing BeautifulSoup and it
# is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and 
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
# To get all the tags present in the html 
# and getting their length
for tag in beautifulSoupText.findAll(True):
    print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

输出：

示例 2：

现在，我们不想抓取一个 HTML 文件，而是要对该目录中存在的所有 HTML 文件进行处理（可能有必要，例如每天，特定目录可能会填充在线数据并作为批处理，必须进行刮擦）。

我们可以使用“os”模块功能。让我们将当前目录中的所有 HTML 文件作为我们的示例

所以我们的任务是让所有 HTML 文件都被废弃。通过下面的方式，我们可以实现。整个文件夹的 HTML 文件被一一抓取，并检索所有文件的标签长度，并在所附视频中进行了展示。

蟒蛇3

# necessary import for getting
# directory and filenames
import os
from bs4 import BeautifulSoup
  
# Get current working directory
directory = os.getcwd()
  
# for all the files present in that
# directory
for filename in os.listdir(directory):
      
    # check whether the file is having
    # the extension as html and it can
    # be done with endswith function
    if filename.endswith('.html'):
          
        # os.path.join() method in Python join
        # one or more path components which helps
        # to exactly get the file
        fname = os.path.join(directory, filename)
        print("Current file name ..", os.path.abspath(fname))
          
        # open the file
        with open(fname, 'r') as file:
            
            beautifulSoupText = BeautifulSoup(file.read(), 'html.parser')
              
            # parse the html as you wish
            for tag in beautifulSoupText.findAll(True):
                print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

输出：