在 pdf 中搜索 (1) - 芒果文档

📌 相关文章

📜 在 pdf 中搜索 (1)

📅 最后修改于: 2023-12-03 15:37:25.728000 🧑 作者: Mango

在 pdf 中搜索

如果您需要在大量的PDF文件中搜索某些关键字，那么手动搜索肯定会非常费时间。但是，可以利用Python编写一个小程序，在PDF文档中搜索指定的关键字。本文将介绍如何使用Python进行PDF文本提取和搜索。

依赖

Python3
PyPDF2库

步骤

Step 1：安装 PyPDF2 库

安装PyPDF2库，使用以下命令：

$pip install PyPDF2

Step 2：编写代码

import PyPDF2

# 打开pdf文件
pdf_file = open('file.pdf', 'rb')
# 创建一个 PdfFileReader 对象
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# 搜索关键词
search_text = "search keyword"
# 创建一个匹配对象
pattern = re.compile(search_text)

# 遍历PDF文件的每一页
for page_num in range(pdf_reader.getNumPages()):
    # 从每一页获取文本
    page_obj = pdf_reader.getPage(page_num)
    page_text = page_obj.extractText()
    # 在文本中寻找匹配对象
    matches = pattern.findall(page_text)

    # 如果找到了，则输出匹配对象
    if len(matches) > 0:
        print("Page {} - {} matches: "
              .format(page_num+1, len(matches)))
        for match in matches:
            print(match)

Step 3: 运行程序

使用以下命令运行程序：

python search_pdf.py

返回的Markdown格式

在PDF中搜索特定关键字的Python代码：

import PyPDF2
import re

pdf_file = open('file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

search_text = "search keyword"
pattern = re.compile(search_text)

for page_num in range(pdf_reader.getNumPages()):
    page_obj = pdf_reader.getPage(page_num)
    page_text = page_obj.extractText()
    matches = pattern.findall(page_text)

    if len(matches) > 0:
        print("Page {} - {} matches: ".format(page_num+1, len(matches)))
        for match in matches:
            print(match)

注意：请替换 file.pdf 和 search keyword 为实际的文件名和搜索关键字。此代码片段仅用于演示目的。