Python|熊猫 Dataframe.duplicated()
Python是一种用于进行数据分析的出色语言,主要是因为以数据为中心的Python包的奇妙生态系统。 Pandas就是其中之一,它使导入和分析数据变得更加容易。
数据分析的一个重要部分是分析重复值并删除它们。 Pandas duplicated()方法仅有助于分析重复值。它返回一个布尔系列,该系列仅适用于唯一元素。
句法:
DataFrame.duplicated(subset=None, keep='first')
参数:
subset: Takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep: Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.
–> If ‘first’, it considers first value as unique and rest of the same values as duplicate.
–> If ‘last’, it considers last value as unique and rest of the same values as duplicate.
–> If False, it consider all of the same values as duplicates.
要下载使用的 CSV 文件,请单击此处。
示例 #1:返回一个布尔系列
在以下示例中,根据 First Name 列中的重复值返回一个布尔系列。
Python
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# sorting by first name
data.sort_values("First Name", inplace = True)
# making a bool series
bool_series = data["First Name"].duplicated()
# displaying data
data.head()
# display data
data[bool_series]
Python
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# sorting by first name
data.sort_values("First Name", inplace = True)
# making a bool series
bool_series = data["First Name"].duplicated(keep = False)
# bool series
bool_series
# passing NOT of bool series to see unique values only
data = data[~bool_series]
# displaying data
data.info()
data
输出:
如输出图像所示,由于 keep 参数默认为“first”,因此无论何时出现名称,第一个都被认为是唯一的并且 res Duplicate。
示例 #2:删除重复项
在此示例中,keep 参数设置为 False,以便仅采用唯一值并从数据中删除重复值。
Python
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# sorting by first name
data.sort_values("First Name", inplace = True)
# making a bool series
bool_series = data["First Name"].duplicated(keep = False)
# bool series
bool_series
# passing NOT of bool series to see unique values only
data = data[~bool_series]
# displaying data
data.info()
data
输出:
由于 duplicated() 方法对 Duplicates 返回 False,因此采用系列的 NOT 来查看 Data Frame 中的唯一值。