Python|使用 OCR（光学字符识别）读取 PDF 的内容

Python广泛用于分析数据，但数据不必总是采用所需的格式。在这种情况下，我们会将该格式（如 PDF 或 JPG 等）转换为文本格式，以便更好地分析数据。 Python提供了许多库来完成这项任务。

有几种方法可以做到这一点，包括在Python中使用 PyPDF2 等库。使用这些库的主要缺点是编码方案。 PDF 文档可以有多种编码，包括 UTF-8、ASCII、Unicode 等。因此，将 PDF 转换为文本可能会由于编码方案而导致数据丢失。

让我们看看如何读取 PDF 文件的所有内容并使用 OCR 将其存储在文本文档中。

首先，我们需要将 PDF 的页面转换为图像，然后使用 OCR（光学字符识别）从图像中读取内容并将其存储在文本文件中。

所需安装：

pip3 install PIL
pip3 install pytesseract
pip3 install pdf2image
sudo apt-get install tesseract-ocr

该程序有两个部分。

第 1 部分处理将 PDF 转换为图像文件。 PDF 的每一页都存储为一个图像文件。存储的图像名称为：
PDF 第 1 页 -> page_1.jpg
PDF 第 2 页 -> page_2.jpg
PDF 第 3 页 -> page_3.jpg
……
PDF 页 n -> page_n.jpg

第 2 部分处理从图像文件中识别文本并将其存储到文本文件中。在这里，我们处理图像并将其转换为文本。一旦我们将文本作为字符串变量，我们就可以对文本进行任何处理。例如，在许多 PDF 中，当一行完成，但特定单词不能完全写在同一行中时，会添加连字符 ('-')，并在下一行继续该单词。例如 -

This is some sample text but this parti-
cular word could not be written in the same line.

现在对于这样的单词，进行基本的预处理以将连字符和新行转换为完整的单词。完成所有预处理后，此文本将存储在单独的文本文件中。

要获取代码中使用的输入 PDF 文件，请单击 d.pdf

下面是实现：

# Import libraries
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
  
# Path of the pdf
PDF_file = "d.pdf"
  
'''
Part #1 : Converting PDF to images
'''
  
# Store all the pages of the PDF in a variable
pages = convert_from_path(PDF_file, 500)
  
# Counter to store images of each page of PDF to image
image_counter = 1
  
# Iterate through all the pages stored above
for page in pages:
  
    # Declaring filename for each page of PDF as JPG
    # For each page, filename will be:
    # PDF page 1 -> page_1.jpg
    # PDF page 2 -> page_2.jpg
    # PDF page 3 -> page_3.jpg
    # ....
    # PDF page n -> page_n.jpg
    filename = "page_"+str(image_counter)+".jpg"
      
    # Save the image of the page in system
    page.save(filename, 'JPEG')
  
    # Increment the counter to update filename
    image_counter = image_counter + 1
  
'''
Part #2 - Recognizing text from the images using OCR
'''
    3
# Variable to get count of total number of pages
filelimit = image_counter-1
  
# Creating a text file to write the output
outfile = "out_text.txt"
  
# Open the file in append mode so that 
# All contents of all images are added to the same file
f = open(outfile, "a")
  
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
  
    # Set filename to recognize text from
    # Again, these files will be:
    # page_1.jpg
    # page_2.jpg
    # ....
    # page_n.jpg
    filename = "page_"+str(i)+".jpg"
          
    # Recognize the text as string in image using pytesserct
    text = str(((pytesseract.image_to_string(Image.open(filename)))))
  
    # The recognized text is stored in variable text
    # Any string processing may be applied on text
    # Here, basic formatting has been done:
    # In many PDFs, at line ending, if a word can't
    # be written fully, a 'hyphen' is added.
    # The rest of the word is written in the next line
    # Eg: This is a sample text this word here GeeksF-
    # orGeeks is half on first line, remaining on next.
    # To remove this, we replace every '-\n' to ''.
    text = text.replace('-\n', '')    
  
    # Finally, write the processed text to the file.
    f.write(text)
  
# Close the file after writing all the text.
f.close()

输出：

输入PDF文件：

输出文本文件：

如我们所见，PDF 的页面被转换为图像。然后读取图像，并将内容写入文本文件。

这种方法的优点包括：

由于编码方案导致数据丢失，因此避免基于文本的转换。
由于使用了 OCR，甚至可以识别 PDF 中的手写内容。
也可以仅识别 PDF 的特定页面。
将文本作为变量获取，以便可以完成任何所需的预处理。

这种方法的缺点包括：

磁盘存储用于将图像存储在本地系统中。尽管这些图像尺寸很小。
使用 OCR 不能保证 100% 的准确性。给定计算机键入的 PDF 文档会产生非常高的准确性。
手写的 PDF 仍然可以识别，但准确度取决于各种因素，例如笔迹、页面颜色等。