📌  相关文章
📜  如何使用Python从本地 HTML 文件中抓取数据?

📅  最后修改于: 2022-05-13 01:54:43.449000             🧑  作者: Mango

如何使用Python从本地 HTML 文件中抓取数据?

Python的BeautifulSoup模块允许我们从本地 HTML 文件中抓取数据。出于某种原因,网站页面可能会存储在本地(离线环境)中,并且在需要时,可能需要从中获取数据。有时也可能需要从多个本地存储的 HTML 文件中获取数据。通常 HTML 文件有

、…

等标签,使用 BeautifulSoup,我们可以抓取内容并获取必要的细节。

安装

可以通过在终端中键入以下命令来安装它。

pip install beautifulsoup4

入门

如果有一个 HTML 文件存储在一个位置,并且我们需要使用 BeautifulSoup 通过Python抓取内容, lxml是一个很好的 API,因为它意味着解析 XML 和 HTML。它支持一步解析和逐步解析。

BeautifulSoup 中的Prettify()函数有助于查看标签性质及其嵌套。

示例:让我们创建一个示例 HTML 文件。



Python3
# Necessary imports
import sys
import urllib.request
  
# Save a reference to the original
# standard output
original_stdout = sys.stdout
  
# as an example, taken my article list
# published link page and stored in local
with urllib.request.urlopen('https://auth.geeksforgeeks.org/user/priyarajtt/articles') as webPageResponse:
    outputHtml = webPageResponse.read()
  
# Scraped contents are placed in 
# samplehtml.html file and getting
# used for next set of examples
with open('samplehtml.html', 'w') as f:
      
    # Here the  standard output is 
    # written to the file that we 
    # used above
    sys.stdout = f
    print(outputHtml)
      
    # Reset the standard output to its 
    # original value
    sys.stdout = original_stdout


Python3
# Importing BeautifulSoup and 
# it is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and
# specifying the parser 
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
  
# Using the prettify method to modify the code
#  Prettify() function in BeautifulSoup helps
# to view about the tag nature and their nesting
print(beautifulSoupText.body.prettify())


Python3
# Importing BeautifulSoup and it
# is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and 
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
# To get all the tags present in the html 
# and getting their length
for tag in beautifulSoupText.findAll(True):
    print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))


Python3
# necessary import for getting
# directory and filenames
import os
from bs4 import BeautifulSoup
  
# Get current working directory
directory = os.getcwd()
  
# for all the files present in that
# directory
for filename in os.listdir(directory):
      
    # check whether the file is having
    # the extension as html and it can
    # be done with endswith function
    if filename.endswith('.html'):
          
        # os.path.join() method in Python join
        # one or more path components which helps
        # to exactly get the file
        fname = os.path.join(directory, filename)
        print("Current file name ..", os.path.abspath(fname))
          
        # open the file
        with open(fname, 'r') as file:
            
            beautifulSoupText = BeautifulSoup(file.read(), 'html.parser')
              
            # parse the html as you wish
            for tag in beautifulSoupText.findAll(True):
                print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))


输出:

现在,使用 prettify() 方法以更简单的方式查看标签和内容。

蟒蛇3

# Importing BeautifulSoup and 
# it is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and
# specifying the parser 
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
  
# Using the prettify method to modify the code
#  Prettify() function in BeautifulSoup helps
# to view about the tag nature and their nesting
print(beautifulSoupText.body.prettify())

输出 :

这样就可以得到HTML数据。现在做一些操作,对数据进行一些洞察。

示例 1:



我们可以使用 find() 方法,随着 HTML 内容的动态变化,我们可能不知道确切的标签名称。那个时候,我们可以先使用 findAll(True) 来获取标签名,然后我们可以做任何类型的操作。例如获取标签的标签名称和长度

蟒蛇3

# Importing BeautifulSoup and it
# is in the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file. If the file
# is present in different location, 
# exact location need to be mentioned
HTMLFileToBeOpened = open("samplehtml.html", "r")
  
# Reading the file and storing in a variable
contents = HTMLFileToBeOpened.read()
  
# Creating a BeautifulSoup object and 
# specifying the parser
beautifulSoupText = BeautifulSoup(contents, 'lxml')
  
# To get all the tags present in the html 
# and getting their length
for tag in beautifulSoupText.findAll(True):
    print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

输出:

示例 2:

现在,我们不想抓取一个 HTML 文件,而是要对该目录中存在的所有 HTML 文件进行处理(可能有必要,例如每天,特定目录可能会填充在线数据并作为批处理,必须进行刮擦)。

我们可以使用“os”模块功能。让我们将当前目录中的所有 HTML 文件作为我们的示例

所以我们的任务是让所有 HTML 文件都被废弃。通过下面的方式,我们可以实现。整个文件夹的 HTML 文件被一一抓取,并检索所有文件的标签长度,并在所附视频中进行了展示。

蟒蛇3

# necessary import for getting
# directory and filenames
import os
from bs4 import BeautifulSoup
  
# Get current working directory
directory = os.getcwd()
  
# for all the files present in that
# directory
for filename in os.listdir(directory):
      
    # check whether the file is having
    # the extension as html and it can
    # be done with endswith function
    if filename.endswith('.html'):
          
        # os.path.join() method in Python join
        # one or more path components which helps
        # to exactly get the file
        fname = os.path.join(directory, filename)
        print("Current file name ..", os.path.abspath(fname))
          
        # open the file
        with open(fname, 'r') as file:
            
            beautifulSoupText = BeautifulSoup(file.read(), 'html.parser')
              
            # parse the html as you wish
            for tag in beautifulSoupText.findAll(True):
                print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text))

输出: