加入 PySpark 后如何避免重复列？

在本文中，我们将讨论如何使用Python在 PySpark 中加入后避免 DataFrame 中的重复列。

创建第一个数据框进行演示：

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 1"],
        ["3", "rohith", "company 2"],
        ["4", "sridevi", "company 1"],
        ["5", "bobby", "company 1"]]
  
# specify column names
columns = ['ID', 'NAME', 'Company']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
dataframe.show()

Python3

# list  of employee data
data1 = [["1", "45000", "IT"],
         ["2", "145000", "Manager"],
         ["6", "45000", "HR"],
         ["5", "34000", "Sales"]]
  
# specify column names
columns = ['ID', 'salary', 'department']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
  
dataframe1.show()

Python3

# inner join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1,
               dataframe.ID == dataframe1.ID,
               "inner").drop(dataframe.ID).show()

Python3

# join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1, ['ID']).show()

输出：

创建第二个数据框进行演示：

Python3

# list  of employee data
data1 = [["1", "45000", "IT"],
         ["2", "145000", "Manager"],
         ["6", "45000", "HR"],
         ["5", "34000", "Sales"]]
  
# specify column names
columns = ['ID', 'salary', 'department']
  
# creating a dataframe from the lists of data
dataframe1 = spark.createDataFrame(data1, columns)
  
dataframe1.show()

输出：

方法一：使用 drop()函数

我们可以使用像内部连接这样的连接来连接数据框，在这个连接之后，我们可以使用 drop 方法删除一个重复的列。

Syntax: dataframe.join(dataframe1,dataframe.column_name == dataframe1.column_name,”inner”).drop(dataframe.column_name)

where,

dataframe is the first dataframe
dataframe1 is the second dataframe
inner specifies inner join
drop() will delete the common column and delete first dataframe column

编程需要懂一点英语

示例：根据 ID 连接两个数据帧并删除第一个数据帧中的重复 ID

Python3

# inner join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1,
               dataframe.ID == dataframe1.ID,
               "inner").drop(dataframe.ID).show()

输出：

方法 2：使用 join()

在这里，我们只是使用 join 来连接两个数据框，然后删除重复的列。

Syntax: dataframe.join(dataframe1, [‘column_name’]).show()

where,

dataframe is the first dataframe
dataframe1 is the second dataframe
column_name is the common column exists in two dataframes

编程需要懂一点英语

示例：基于 ID 加入并删除重复项

Python3

# join on two dataframes
# and remove duplicate column
dataframe.join(dataframe1, ['ID']).show()

输出：