删除 PySpark DataFrame 中的重复行
在本文中,我们将使用Python的pyspark通过使用distinct()和dropDuplicates()函数从数据框中删除重复行。
让我们创建一个示例数据框
Python3
# importing module
import pyspark
# importing sparksession from
# pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving
# an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["1", "sravan", "company 1"],
["4", "sridevi", "company 1"]]
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company']
# creating a dataframe from the
# lists of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')
dataframe.show()
Python3
print('distinct data after dropping duplicate rows')
# display distinct data
dataframe.distinct().show()
Python3
# display distinct data in Employee
# ID and Employee NAME
dataframe.select(['Employee ID', 'Employee NAME']).distinct().show()
Python3
# remove duplucate data using
# dropDuplicates()function
dataframe.dropDuplicates().show()
Python3
# remove duplicate data using
# dropDuplicates() function in
# two columns
dataframe.select(['Employee ID', 'Employee NAME']
).dropDuplicates().show()
输出:
方法一:区别
独特的数据意味着独特的数据。它将删除数据框中的重复行
Syntax: dataframe.distinct()
where, dataframe is the dataframe name created from the nested lists using pyspark
蟒蛇3
print('distinct data after dropping duplicate rows')
# display distinct data
dataframe.distinct().show()
输出:
我们可以使用 select()函数和 distinct函数从特定列中获取不同的值
Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()
蟒蛇3
# display distinct data in Employee
# ID and Employee NAME
dataframe.select(['Employee ID', 'Employee NAME']).distinct().show()
输出:
方法二:dropDupliacate
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
蟒蛇3
# remove duplucate data using
# dropDuplicates()function
dataframe.dropDuplicates().show()
输出:
Python程序删除特定列中的重复值
蟒蛇3
# remove duplicate data using
# dropDuplicates() function in
# two columns
dataframe.select(['Employee ID', 'Employee NAME']
).dropDuplicates().show()
输出: