加入 PySpark 后如何避免重复列?
在本文中,我们将讨论如何使用Python在 PySpark 中加入后避免 DataFrame 中的重复列。
创建第一个数据框进行演示:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["5", "bobby", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
Python3
# list of employee data
data1 = [["1", "45000", "IT"],
["2", "145000", "Manager"],
["6", "45000", "HR"],
["5", "34000", "Sales"]]
# specify column names
columns = ['ID', 'salary', 'department']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
dataframe1.show()
Python3
# inner join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1,
dataframe.ID == dataframe1.ID,
"inner").drop(dataframe.ID).show()
Python3
# join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1, ['ID']).show()
输出:
创建第二个数据框进行演示:
Python3
# list of employee data
data1 = [["1", "45000", "IT"],
["2", "145000", "Manager"],
["6", "45000", "HR"],
["5", "34000", "Sales"]]
# specify column names
columns = ['ID', 'salary', 'department']
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
dataframe1.show()
输出:
方法一:使用 drop()函数
我们可以使用像内部连接这样的连接来连接数据框,在这个连接之后,我们可以使用 drop 方法删除一个重复的列。
Syntax: dataframe.join(dataframe1,dataframe.column_name == dataframe1.column_name,”inner”).drop(dataframe.column_name)
where,
- dataframe is the first dataframe
- dataframe1 is the second dataframe
- inner specifies inner join
- drop() will delete the common column and delete first dataframe column
示例:根据 ID 连接两个数据帧并删除第一个数据帧中的重复 ID
Python3
# inner join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1,
dataframe.ID == dataframe1.ID,
"inner").drop(dataframe.ID).show()
输出:
方法 2:使用 join()
在这里,我们只是使用 join 来连接两个数据框,然后删除重复的列。
Syntax: dataframe.join(dataframe1, [‘column_name’]).show()
where,
- dataframe is the first dataframe
- dataframe1 is the second dataframe
- column_name is the common column exists in two dataframes
示例:基于 ID 加入并删除重复项
Python3
# join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1, ['ID']).show()
输出: