如何使用Python将非结构化数据转换为结构化数据?
先决条件:什么是非结构化数据?
有时机器以非结构化的方式生成数据,这种方式不太容易解释。例如,在生物识别数据中,一名员工错误地多次打孔 – IN 或 OUT。我们无法分析数据并识别错误,除非它是表格形式。在本文中,我们将采用非结构化的生物特征数据并将其转换为表格形式的有用信息。
数据集:
在这里,我们将使用 Daily Punch – In Report。数据如下。为 Main Door 和 Second Door 捕获的打孔记录。主门是室外门,第二门是项目室门。我们需要确定哪个员工在项目室或第二扇门上花了多少时间。我们要使用的数据集是 Bio.xlsx:
这是 John Sherrif 的生物识别数据,其中为 Main Door 和 Second Door 提供了打卡和打卡记录。
Department: | ||
---|---|---|
Emp Code: COMP123:John Sherrif | ||
Att. Date | Status | Punch Records |
02-Mar-2021 | Present | 08:33:in(Second Door),08:35:(Second Door),08:37:(Main Door),09:04:out(Second Door),09:09:in(Second Door), 09:15:out(Second Door),09:15:(Second Door),09:18:(Second Door),09:52:in(Second Door),09:54:(Second Door), 10:00:out(Main Door),10:17:in(Main Door),10:53:out(Second Door),11:47:in(Second Door),11:47:(Second Door), 11:49:(Second Door),11:50:(Second Door),13:08:out(Second Door),13:09:(Second Door),13:12:(Second Door), 13:14:in(Second Door),13:36:out(Second Door),13:36:(Second Door),14:27:in(Second Door),14:32:out(Main Door), 14:48:in(Second Door),14:48:(Second Door),14:49:(Second Door),14:52:(Main Door),14:56:out(Second Door), 14:57:(Second Door),14:59:(Second Door),15:04:in(Second Door),16:22:out(Second Door),16:34:in(Second Door), 19:58:out(Main Door), |
上面的数据不好分析。我们想要的输出是:Emp Code Punch – IN Punch – OUT COMP123:John Sherrif 08:33:in(Second Door) 09:04:out(Second Door) COMP123:John Sherrif 09:09:in(Second Door) 09:15:out(Second Door) COMP123:John Sherrif 09:52:in(Second Door) 10:53:out(Second Door) COMP123:John Sherrif 11:47:in(Second Door) 13:08:out(Second Door) COMP123:John Sherrif 13:14:in(Second Door) 13:36:out(Second Door) COMP123:John Sherrif 14:27:in(Second Door) out COMP123:John Sherrif 14:48:in(Second Door) 14:56:out(Second Door) COMP123:John Sherrif 15:04:in(Second Door) 16:22:out(Second Door) COMP123:John Sherrif 16:34:in(Second Door) out
理解数据: John Sherrif 首次在 08:33 进行 Punch – IN 和 Punch – OUT 首次在 09:04 进行。约翰在 14:27 做了 Punch – IN 但忘记做 Punch – OUT。 'in' 表示他/她忘记了 Punch IN,而 'out' 表示相反。
执行:
- 数据清理和创建状态、打卡代码和员工代码表。
Python3
import pandas as pd
# load data
df = pd.read_excel('bio.xlsx')
# removing NA values from the
# dataframe df
df = df.fillna("")
# removing all the blank rows
df1 = df.dropna(how='all')
# picking the rows where present
# or absent values are there from
# 14 no column
df1 = df1[df1['Unnamed: 14'].str.contains('sent')]
# Extracting only the Employee
# Names
df_name = df.dropna(how='all')
# from column no 3 we are picking
# Employee names
df_name = df_name[df_name['Unnamed: 3'].str.contains('Employee')]
# creating a new dataframe for Status,
# Punch Records and Employee Codes
zippedList = list(
zip(df1['Unnamed: 14'], df1['Unnamed: 15'], df_name['Unnamed: 7']))
abc = pd.DataFrame(zippedList)
abc.head()
Python3
# Splitting the values by comma in 1
# no column (punch records)
for i in range(len(abc)):
abc[1][i] = abc[1][i].split(",")
second_door = []
for i in range(len(abc)):
s_d = []
# Extracting all the values which contains
# only :in(Second Door) or :out(Second Dorr)
for j in range(len(abc[1][i])):
if ':in(Second Door)' in abc[1][i][j]:
s_d.append(abc[1][i][j])
if 'out(Second Door)' in abc[1][i][j]:
s_d.append(abc[1][i][j])
second_door.append(s_d)
(second_door[0])
Python3
# Punch Records should start with
# the keyword 'in'. If it doesn't
# follow then we wil add 'in' and it
# significants that the employee forgot
# to do punch in
in_time = []
for i in range(len(second_door)):
try:
if ':in(Second Door)' not in second_door[i][0]:
second_door[i].insert(0, 'in')
except:
pass
# Punch Records should end with the keyword
# 'out'. If it doesn't follow then we wil
# add 'out' and it significants that the
# employee forgot to do punch out
out_time = []
for i in range(len(second_door)):
try:
if ':out(Second Door)' not in second_door[i][(len(second_door[i]))-1]:
second_door[i].insert(((len(second_door[i]))), 'out')
except:
pass
second_door[0]
Python3
# final_in contains PUNCH - IN
# records for all employees
final_in = []
# final_out contains PUNCH - OUT
# records for all employees
final_out = []
for k in range(len(second_door)):
in_gate = []
out_gate = []
# even position should be for Punch-
# IN and odd position should be for
# Punch - OUT if it doesn't follow
# then we will create the pattern by
# putting 'in' or 'out'
for i in range(len(second_door[k])):
if i % 2 == 0 and 'in' in second_door[k][i]:
in_gate.append(second_door[k][i])
try:
if 'out' not in second_door[k][i+1]:
out_gate.append('out')
except:
pass
if i % 2 != 0 and 'out' in second_door[k][i]:
out_gate.append(second_door[k][i])
try:
if 'in' not in second_door[k][i+1]:
in_gate.append('in')
except:
pass
if i % 2 != 0 and 'in' in second_door[k][i]:
in_gate.append(second_door[k][i])
try:
if 'out' not in second_door[k][i+1]:
out_gate.append('out')
except:
pass
if i % 2 == 0 and 'out' in second_door[k][i]:
out_gate.append(second_door[k][i])
try:
if 'in' not in second_door[k][i+1]:
in_gate.append('in')
except:
pass
final_in.append(in_gate)
final_out.append(out_gate)
# final_in or final_out keep the
# records as a list under list form.
# to solve the problem we will merge the list
# aa contains merged list of Punch - IN
aa = final_in[0]
for i in range(len(final_in)-1):
aa = aa + final_in[i+1]
# bb contains merged list of Punch - OUT
bb = final_out[0]
for i in range(len(final_out)-1):
bb = bb + final_out[i+1]
for i in range(len(final_in[0])):
print(final_in[0][i], ' ', final_out[0][i])
Python
# Creating a dataframe called df_final
df_final = []
df_final = pd.DataFrame(df_final)
# Merging the Employee Names
Name = []
for i in range(len(abc)):
for j in range(len(final_in[i])):
Name.append(abc[2][i])
df_final['Name'] = Name
# Zipping the Employee Name, Punch -IN
# records and Punch - OUT records
zippedList2 = list(zip(df_final['Name'], aa, bb))
abc2 = pd.DataFrame(zippedList2)
# Renaming the dataframe
abc2.columns = ['Emp Code', 'Punch - IN', 'Punch - OUT']
abc2.to_excel('output.xlsx', index=False)
# Print the table
display(abc2)
输出:
- 仅提取第二道门的数据。
蟒蛇3
# Splitting the values by comma in 1
# no column (punch records)
for i in range(len(abc)):
abc[1][i] = abc[1][i].split(",")
second_door = []
for i in range(len(abc)):
s_d = []
# Extracting all the values which contains
# only :in(Second Door) or :out(Second Dorr)
for j in range(len(abc[1][i])):
if ':in(Second Door)' in abc[1][i][j]:
s_d.append(abc[1][i][j])
if 'out(Second Door)' in abc[1][i][j]:
s_d.append(abc[1][i][j])
second_door.append(s_d)
(second_door[0])
输出:
- 打孔记录应以“IN”开头并以“OUT”结尾。如果不遵循,则创建模式。
蟒蛇3
# Punch Records should start with
# the keyword 'in'. If it doesn't
# follow then we wil add 'in' and it
# significants that the employee forgot
# to do punch in
in_time = []
for i in range(len(second_door)):
try:
if ':in(Second Door)' not in second_door[i][0]:
second_door[i].insert(0, 'in')
except:
pass
# Punch Records should end with the keyword
# 'out'. If it doesn't follow then we wil
# add 'out' and it significants that the
# employee forgot to do punch out
out_time = []
for i in range(len(second_door)):
try:
if ':out(Second Door)' not in second_door[i][(len(second_door[i]))-1]:
second_door[i].insert(((len(second_door[i]))), 'out')
except:
pass
second_door[0]
输出:
- 创建模式“IN – OUT – IN – .....- OUT”。如果有人忘记做 Punch – IN 那么我们将输入“IN”,如果有人忘记做 Punch – OUT 那么我们将输入“OUT”。
蟒蛇3
# final_in contains PUNCH - IN
# records for all employees
final_in = []
# final_out contains PUNCH - OUT
# records for all employees
final_out = []
for k in range(len(second_door)):
in_gate = []
out_gate = []
# even position should be for Punch-
# IN and odd position should be for
# Punch - OUT if it doesn't follow
# then we will create the pattern by
# putting 'in' or 'out'
for i in range(len(second_door[k])):
if i % 2 == 0 and 'in' in second_door[k][i]:
in_gate.append(second_door[k][i])
try:
if 'out' not in second_door[k][i+1]:
out_gate.append('out')
except:
pass
if i % 2 != 0 and 'out' in second_door[k][i]:
out_gate.append(second_door[k][i])
try:
if 'in' not in second_door[k][i+1]:
in_gate.append('in')
except:
pass
if i % 2 != 0 and 'in' in second_door[k][i]:
in_gate.append(second_door[k][i])
try:
if 'out' not in second_door[k][i+1]:
out_gate.append('out')
except:
pass
if i % 2 == 0 and 'out' in second_door[k][i]:
out_gate.append(second_door[k][i])
try:
if 'in' not in second_door[k][i+1]:
in_gate.append('in')
except:
pass
final_in.append(in_gate)
final_out.append(out_gate)
# final_in or final_out keep the
# records as a list under list form.
# to solve the problem we will merge the list
# aa contains merged list of Punch - IN
aa = final_in[0]
for i in range(len(final_in)-1):
aa = aa + final_in[i+1]
# bb contains merged list of Punch - OUT
bb = final_out[0]
for i in range(len(final_out)-1):
bb = bb + final_out[i+1]
for i in range(len(final_in[0])):
print(final_in[0][i], ' ', final_out[0][i])
输出:
- 创建最终表。
Python
# Creating a dataframe called df_final
df_final = []
df_final = pd.DataFrame(df_final)
# Merging the Employee Names
Name = []
for i in range(len(abc)):
for j in range(len(final_in[i])):
Name.append(abc[2][i])
df_final['Name'] = Name
# Zipping the Employee Name, Punch -IN
# records and Punch - OUT records
zippedList2 = list(zip(df_final['Name'], aa, bb))
abc2 = pd.DataFrame(zippedList2)
# Renaming the dataframe
abc2.columns = ['Emp Code', 'Punch - IN', 'Punch - OUT']
abc2.to_excel('output.xlsx', index=False)
# Print the table
display(abc2)
输出:
因此,原始生物特征数据已被结构化并转换为有用信息。