从 PySpark 中的数据框中删除重复项
在本文中,我们将使用Python的pyspark 从数据框中删除重复数据
在开始之前,我们将创建 Dataframe 进行演示:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data =[["1","sravan","company 1"],
["2","ojaswi","company 1"],
["3","rohith","company 2"],
["4","sridevi","company 1"],
["1","sravan","company 1"],
["4","sridevi","company 1"]]
# specify column names
columns = ['Employee ID','Employee NAME','Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
print('Actual data in dataframe')
dataframe.show()
Python3
print('distinct data after dropping duplicate rows')
# display distinct data
dataframe.distinct().show()
Python3
# display distinct data in
# Employee ID and Employee NAME
dataframe.select(['Employee ID',
'Employee NAME']).distinct().show()
Python3
# remove duplucate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()
Python3
# remove duplicate data
# using dropDuplicates()function
# in two columns
dataframe.select(['Employee ID',
'Employee NAME']).dropDuplicates().show()
输出:
方法一:使用distinct()方法
它将删除数据框中的重复行
Syntax: dataframe.distinct()
Where, dataframe is the dataframe name created from the nested lists using pyspark
示例 1:使用 distinct()函数删除重复数据的Python程序
蟒蛇3
print('distinct data after dropping duplicate rows')
# display distinct data
dataframe.distinct().show()
输出:
示例 2:仅在两列中选择不同数据的Python程序。
我们可以使用 select()函数和 distinct函数从特定列中获取不同的值
Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()
蟒蛇3
# display distinct data in
# Employee ID and Employee NAME
dataframe.select(['Employee ID',
'Employee NAME']).distinct().show()
输出:
方法二:使用 dropDuplicates() 方法
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
示例 1:从员工表中删除重复数据的Python程序。
蟒蛇3
# remove duplucate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()
输出:
示例 2:删除特定列中重复值的Python程序
蟒蛇3
# remove duplicate data
# using dropDuplicates()function
# in two columns
dataframe.select(['Employee ID',
'Employee NAME']).dropDuplicates().show()
输出: