如何在Python中提取 PDF 表格？

本主题是关于从 PDF 中提取表格的方法输入Python。首先，让我们讨论一下什么是PDF文件？

PDF（便携式文档格式）可能是一种文件格式，它将打印文档的所有天气都捕获为位图，您可以轻松地查看、导航、打印或转发给其他人。 PDF 文件是使用 Adobe Acrobat 创建的，

例子：

假设一个 PDF 文件包含一个表格

User_ID	Name	Occupation
1	David	Product Manage
2	Leo	IT Administrator
3	John	Lawyer

我们想将此表读入我们的Python程序。这个问题可以使用几种方法来解决。让我们一一讨论。

方法一：使用tabula-py

tabula-py 是 tabula-java 的简单Python包装器，它可以读取 PDF 中的表格。您可以使用该命令安装 tabula-py 库。

pip install tabula-py
pip install tabulate

示例中使用的方法是：

read_pdf(): reads the data from the tables of the PDF file of the given address

tabulate(): arranges the data in a table format

编程需要懂一点英语

这里使用的 PDF 文件是 PDF。

Python3

from tabula import read_pdf
from tabulate import tabulate
  
#reads table from pdf file
df = read_pdf("abc.pdf",pages="all") #address of pdf file
print(tabulate(df))

Python3

import camelot
  
# extract all the tables in the PDF file
abc = camelot.read_pdf("test.pdf")   #address of file loation
  
# print the first table as Pandas DataFrame
print(abc[0].df)

输出：

方法 2：使用 Camelot

Camelot 是一个Python库，可帮助从 PDF 文件中提取表格。您可以使用以下命令安装 camelot-py 库

pip install camelot-py

示例中使用的方法是：

read_pdf(): reads the data from the tables of the pdf file of the given address

tables[index].df: points towards the desired table of a given index

编程需要懂一点英语

这里使用的 PDF 文件是 PDF。

蟒蛇3

import camelot
  
# extract all the tables in the PDF file
abc = camelot.read_pdf("test.pdf")   #address of file loation
  
# print the first table as Pandas DataFrame
print(abc[0].df)

输出：