Python|熊猫 dataframe.drop_duplicates()
Python是一种用于进行数据分析的出色语言,主要是因为以数据为中心的Python包的奇妙生态系统。 Pandas就是其中之一,它使导入和分析数据变得更加容易。
数据分析的一个重要部分是分析重复值并删除它们。 Pandas drop_duplicates()方法有助于从数据框中删除重复项。
Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.
- If ‘first’, it considers first value as unique and rest of the same values as duplicate.
- If ‘last’, it considers last value as unique and rest of the same values as duplicate.
- If False, it consider all of the same values as duplicates
inplace: Boolean values, removes rows with duplicates if True.
Return type: DataFrame with removed duplicate rows depending on Arguments passed.
要下载使用的 CSV 文件,请单击此处。
示例 #1:删除具有相同名字的行
在以下示例中,删除具有相同名字的行并返回一个新的数据框。
Python
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# sorting by first name
data.sort_values("First Name", inplace = True)
# dropping ALL duplicate values
data.drop_duplicates(subset ="First Name",
keep = False, inplace = True)
# displaying data
data
Python
#importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
#length before adding row
length1 = len(data)
# manually inserting duplicate of a row of row 440
data.loc[1001] = [data["First Name"][440],
data["Gender"][440],
data["Start Date"][440],
data["Last Login Time"][440],
data["Salary"][440],
data["Bonus %"][440],
data["Senior Management"][440],
data["Team"][440]]
# length after adding row
length2= len(data)
# sorting by first name
data.sort_values("First Name", inplace=True)
# dropping duplicate values
data.drop_duplicates(keep=False,inplace=True)
# length after removing duplicates
length3=len(data)
# printing all data frame lengths
print(length1, length2, length3)
输出:
如图所示,具有相同名称的行已从数据框中删除。
示例 #2:删除所有重复值的行
在此示例中,将删除具有所有值的行。由于 csv 文件没有这样的行,因此首先复制随机行并将其插入数据框中。
Python
#importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
#length before adding row
length1 = len(data)
# manually inserting duplicate of a row of row 440
data.loc[1001] = [data["First Name"][440],
data["Gender"][440],
data["Start Date"][440],
data["Last Login Time"][440],
data["Salary"][440],
data["Bonus %"][440],
data["Senior Management"][440],
data["Team"][440]]
# length after adding row
length2= len(data)
# sorting by first name
data.sort_values("First Name", inplace=True)
# dropping duplicate values
data.drop_duplicates(keep=False,inplace=True)
# length after removing duplicates
length3=len(data)
# printing all data frame lengths
print(length1, length2, length3)
输出:
如输出图像所示,删除重复后的长度为 999。由于将 keep 参数设置为 False,因此删除了所有重复行。