如何从Python中的常见文件格式中提取数据?
有时使用某些数据集必须主要只使用.csv(逗号分隔值)文件。它们确实是应用数据科学技术和算法的一个很好的起点。但是我们中的许多人迟早会进入数据科学公司或从事数据科学领域的实际项目。不幸的是,在实际项目中,我们无法在整洁的 .csv 文件中使用这些数据。在那里,我们必须从不同来源(如图像、pdf 文件、doc 文件、图像文件等)中提取数据。在本文中,我们将看到解决这些情况的完美开端。
下面我们将看到如何从多个此类来源中提取相关信息。
1. 多张 Excel 文件
请注意,如果 Excel 文件只有一张工作表,那么读取 CSV 文件 ( pd.read_csv('File.xlsx') ) 的相同方法可能会起作用。但它不会在多个工作表文件的情况下,如下图所示,其中有 3 张工作表(工作表 1、工作表 2、工作表 3)。在这种情况下,它只会返回第一张纸。
使用的 Excel 工作表:单击此处。
示例:我们将看到如何读取这个 excel 文件。
Python3
# import Pandas library
import pandas as pd
# Read our file. Here sheet_name=1
# means we are reading the 2nd sheet or Sheet2
df = pd.read_excel('Sample1.xlsx', sheet_name = 1)
df.head()
Python3
# Read only column A, B, C of all
# the four columns A,B,C,D in Sheet2
df=pd.read_excel('Sample1.xlsx',
sheet_name = 1, usecols = 'A : C')
df.head()
Python3
df2 = pd.DataFrame()
for i in df.keys():
df2 = pd.concat([df2, df[i]],
axis = 0)
display(df2)
Python3
# We import necessary libraries.
# The PIL Library is used to read the images
from PIL import Image
import pytesseract
# Read the image
image = Image.open(r'pic.png')
# Perform the information extraction from images
# Note below, put the address where tesseract.exe
# file is located in your system
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
print(pytesseract.image_to_string(image))
Python3
# Importing our library and reading the doc file
import docx
doc = docx.Document('csv/g.docx')
# Printing the title
print(doc.paragraphs[0].text)
Python3
# Getting all the text in the doc file
l=[doc.paragraphs[i].text for i in range(len(doc.paragraphs))]
# There might be many useless empty
# strings present so removing them
l=[i for i in l if len(i)!=0]
print(l)
Python3
# Since there are only one table in
# our doc file we are using 0. For multiple tables
# you can use suitable for toop
table = doc.tables[0]
# Initializing some empty list
list1 = []
list2 = []
# Looping through each row of table
for i in range(len(table.rows)):
# Looping through each column of a row
for j in range(len(table.columns)):
# Extracting the required text
list1.append(table.rows[i].cells[j].paragraphs[0].text)
list2.append(list1[:])
list1.clear()
print(list2)
Python3
# import module
import fitz
# Reading our pdf file
docu=fitz.open('file.pdf')
# Initializing an empty list where we will put all text
text_list=[]
# Looping through all pages of the pdf file
for i in range(docu.pageCount):
# Loading each page
pg=docu.loadPage(i)
# Extracting text from each page
pg_txt=pg.getText('text')
# Appending text to the empty list
text_list.append(pg_txt)
# Cleaning the text by removing useless
# empty strings and unicode character '\u200b'
text_list=[i.replace(u'\u200b','') for i in text_list[0].split('\n') if len(i.strip()) ! = 0]
print(text_list)
Python3
# Iterating through the pages
for current_page in range(len(docu)):
# Getting the images in that page
for image in docu.getPageImageList(current_page):
# get the XREF of the image . XREF can be thought of a
# container holding the location of the image
xref=image[0]
# extract the object i.e,
# the image in our pdf file at that XREF
pix=fitz.Pixmap(docu,xref)
# Storing the image as .png
pix.writePNG('page %s - %s.png'%(current_page,xref))
Python3
# Import necessary library
import matplotlib.pyplot as plt
# Read and display the image
img=plt.imread('page 0 - 7.png')
plt.imshow(img)
输出:
现在让我们阅读同一工作表的选定列:
蟒蛇3
# Read only column A, B, C of all
# the four columns A,B,C,D in Sheet2
df=pd.read_excel('Sample1.xlsx',
sheet_name = 1, usecols = 'A : C')
df.head()
输出:
现在让我们一起阅读所有表格:
Sheet1 包含A、B、C列; Sheet2包含A, B, C, D并且 Sheet3 包含B, D 。我们将在下面看到一个关于如何一起阅读所有 3 个工作表并将它们合并到公共列中的简单示例。
蟒蛇3
df2 = pd.DataFrame()
for i in df.keys():
df2 = pd.concat([df2, df[i]],
axis = 0)
display(df2)
输出:
2. 从图像中提取文本
现在我们将讨论如何从图像中提取文本。
为了使我们的Python程序具有字符识别功能,我们将使用pytesseract OCR 库。通过在操作系统的命令解释器中执行以下命令,可以将该库安装到我们的Python环境中:-
pip install pytesseract
该库(如果在 Windows 操作系统上使用)还需要tesseract.exe二进制文件,以便正确安装该库。在上述可执行文件的安装过程中,系统会提示我们为其指定路径。需要记住此路径,因为稍后将在代码中使用它。对于大多数安装,路径为C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe。
演示图片:
蟒蛇3
# We import necessary libraries.
# The PIL Library is used to read the images
from PIL import Image
import pytesseract
# Read the image
image = Image.open(r'pic.png')
# Perform the information extraction from images
# Note below, put the address where tesseract.exe
# file is located in your system
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
print(pytesseract.image_to_string(image))
输出:
GeeksforGeeks
3.从Doc文件中提取文本
在这里,我们将使用 docx 模块从 doc 文件中提取文本。
安装:
pip install python-docx
演示图片: Aniket_Doc.docx
示例 1:首先我们将提取标题:
蟒蛇3
# Importing our library and reading the doc file
import docx
doc = docx.Document('csv/g.docx')
# Printing the title
print(doc.paragraphs[0].text)
输出:
My Name Aniket
示例 2:然后我们将提取存在的不同文本(不包括表格)。
蟒蛇3
# Getting all the text in the doc file
l=[doc.paragraphs[i].text for i in range(len(doc.paragraphs))]
# There might be many useless empty
# strings present so removing them
l=[i for i in l if len(i)!=0]
print(l)
输出:
[‘My Name Aniket’, ‘ Hello I am Aniket’, ‘I am giving tutorial on how to extract text from MS Doc.’, ‘Please go through it carefully.’]
示例 3:现在我们将提取表:
蟒蛇3
# Since there are only one table in
# our doc file we are using 0. For multiple tables
# you can use suitable for toop
table = doc.tables[0]
# Initializing some empty list
list1 = []
list2 = []
# Looping through each row of table
for i in range(len(table.rows)):
# Looping through each column of a row
for j in range(len(table.columns)):
# Extracting the required text
list1.append(table.rows[i].cells[j].paragraphs[0].text)
list2.append(list1[:])
list1.clear()
print(list2)
输出:
[['A', 'B', 'C'], ['12', 'aNIKET', '@@@'], ['3', 'SOM', '+12&']]
4. 从 PDF 文件中提取数据
任务是在Python中从 PDF 中提取数据(图像、文本)。我们将从 PDF 文件中提取图像并使用 PyMuPDF 库保存它们。首先,我们必须使用 Pillow 安装 PyMuPDF 库。
pip install PyMuPDF Pillow
示例 1:
现在我们将从同一 doc 文件的 pdf 版本中提取数据。
蟒蛇3
# import module
import fitz
# Reading our pdf file
docu=fitz.open('file.pdf')
# Initializing an empty list where we will put all text
text_list=[]
# Looping through all pages of the pdf file
for i in range(docu.pageCount):
# Loading each page
pg=docu.loadPage(i)
# Extracting text from each page
pg_txt=pg.getText('text')
# Appending text to the empty list
text_list.append(pg_txt)
# Cleaning the text by removing useless
# empty strings and unicode character '\u200b'
text_list=[i.replace(u'\u200b','') for i in text_list[0].split('\n') if len(i.strip()) ! = 0]
print(text_list)
输出:
[‘My Name Aniket ‘, ‘ Hello I am Aniket ‘, ‘I am giving tutorial on how to extract text from MS Doc. ‘, ‘Please go through it carefully. ‘, ‘A ‘, ‘B ‘, ‘C ‘, ’12 ‘, ‘aNIKET ‘, ‘@@@ ‘, ‘3 ‘, ‘SOM ‘, ‘+12& ‘]
示例 2:从 PDF 中提取图像。
蟒蛇3
# Iterating through the pages
for current_page in range(len(docu)):
# Getting the images in that page
for image in docu.getPageImageList(current_page):
# get the XREF of the image . XREF can be thought of a
# container holding the location of the image
xref=image[0]
# extract the object i.e,
# the image in our pdf file at that XREF
pix=fitz.Pixmap(docu,xref)
# Storing the image as .png
pix.writePNG('page %s - %s.png'%(current_page,xref))
图像以page_no.-xref.png格式存储在我们当前的文件位置。在我们的例子中,它的名字是page 0-7.png 。
现在让我们绘制查看图像。
蟒蛇3
# Import necessary library
import matplotlib.pyplot as plt
# Read and display the image
img=plt.imread('page 0 - 7.png')
plt.imshow(img)
输出: