从 PySpark 中的数据框中删除重复项

在本文中，我们将使用Python的pyspark 从数据框中删除重复数据

在开始之前，我们将创建 Dataframe 进行演示：

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data 
data =[["1","sravan","company 1"],
       ["2","ojaswi","company 1"],
       ["3","rohith","company 2"],
       ["4","sridevi","company 1"],
       ["1","sravan","company 1"],
       ["4","sridevi","company 1"]]
  
# specify column names
columns = ['Employee ID','Employee NAME','Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
  
print('Actual data in dataframe')
dataframe.show()

Python3

print('distinct data after dropping duplicate rows')
  
# display distinct data
dataframe.distinct().show()

Python3

# display distinct data in
# Employee ID and Employee NAME 
dataframe.select(['Employee ID',
                  'Employee NAME']).distinct().show()

Python3

# remove duplucate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()

Python3

# remove duplicate data
# using dropDuplicates()function 
# in two columns
dataframe.select(['Employee ID',
                  'Employee NAME']).dropDuplicates().show()

输出：

方法一：使用distinct()方法

它将删除数据框中的重复行

Syntax: dataframe.distinct()

Where, dataframe is the dataframe name created from the nested lists using pyspark

编程需要懂一点英语

示例 1：使用 distinct()函数删除重复数据的Python程序

蟒蛇3

print('distinct data after dropping duplicate rows')
  
# display distinct data
dataframe.distinct().show()

输出：

示例 2：仅在两列中选择不同数据的Python程序。

我们可以使用 select()函数和 distinct函数从特定列中获取不同的值

Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()

编程需要懂一点英语

蟒蛇3

# display distinct data in
# Employee ID and Employee NAME 
dataframe.select(['Employee ID',
                  'Employee NAME']).distinct().show()

输出：

方法二：使用 dropDuplicates() 方法

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

编程需要懂一点英语

示例 1：从员工表中删除重复数据的Python程序。

蟒蛇3

# remove duplucate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()

输出：

示例 2：删除特定列中重复值的Python程序

蟒蛇3

# remove duplicate data
# using dropDuplicates()function 
# in two columns
dataframe.select(['Employee ID',
                  'Employee NAME']).dropDuplicates().show()

输出：