📜  删除 PySpark DataFrame 中的重复行

📅  最后修改于: 2022-05-13 01:55:14.884000             🧑  作者: Mango

删除 PySpark DataFrame 中的重复行

在本文中,我们将使用Python的pyspark通过使用distinct()dropDuplicates()函数从数据框中删除重复行。

让我们创建一个示例数据框

Python3
# importing module
import pyspark
  
# importing sparksession from 
# pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving 
# an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [["1", "sravan", "company 1"], 
        ["2", "ojaswi", "company 1"], 
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"], 
        ["1", "sravan", "company 1"], 
        ["4", "sridevi", "company 1"]]
  
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company']
  
# creating a dataframe from the 
# lists of data
dataframe = spark.createDataFrame(data, columns)
  
print('Actual data in dataframe')
dataframe.show()


Python3
print('distinct data after dropping duplicate rows')
  
# display distinct data
dataframe.distinct().show()


Python3
# display distinct data in Employee
# ID and Employee NAME
dataframe.select(['Employee ID', 'Employee NAME']).distinct().show()


Python3
# remove duplucate data using 
# dropDuplicates()function
dataframe.dropDuplicates().show()


Python3
# remove duplicate data using 
# dropDuplicates() function in 
# two columns
dataframe.select(['Employee ID', 'Employee NAME']
                ).dropDuplicates().show()


输出:



方法一:区别

独特的数据意味着独特的数据。它将删除数据框中的重复行

蟒蛇3

print('distinct data after dropping duplicate rows')
  
# display distinct data
dataframe.distinct().show()

输出:

我们可以使用 select()函数和 distinct函数从特定列中获取不同的值

蟒蛇3

# display distinct data in Employee
# ID and Employee NAME
dataframe.select(['Employee ID', 'Employee NAME']).distinct().show()

输出:

方法二:dropDupliacate

蟒蛇3

# remove duplucate data using 
# dropDuplicates()function
dataframe.dropDuplicates().show()

输出:

Python程序删除特定列中的重复值

蟒蛇3

# remove duplicate data using 
# dropDuplicates() function in 
# two columns
dataframe.select(['Employee ID', 'Employee NAME']
                ).dropDuplicates().show()

输出: