在 PySpark 中加入 DataFrame 后删除重复的列

在本文中，我们将讨论如何在 PySpark 中的 DataFrame join 后删除重复列。

创建第一个数据框进行演示：

Python3

# Importing necessary libraries
from pyspark.sql import SparkSession
 
# Create a spark session
spark = SparkSession.builder.appName('pyspark \
- example join').getOrCreate()
 
# Create data in dataframe
data = [(('Ram'),1,'M'),
          (('Mike'),2,'M'),
          (('Rohini'),3,'M'),
          (('Maria'),4,'F'),
          (('Jenis'),5,'F')]
 
# Column names in dataframe
columns = ["Name","ID","Gender"]
 
# Create the spark dataframe
df1 = spark.createDataFrame(data=data, schema = columns)
 
# Print the dataframe
df1.show()

Python3

# Create data in dataframe
data2 = [(1,3000),
          (2,4000),
          (3,4000),
          (4,4000),
          (5, 1200)]
 
# Column names in dataframe
columns = ["ID","salary"]
 
# Create the spark dataframe
df2 = spark.createDataFrame(data=data2,
                            schema = columns)
 
# Print the dataframe
df2.show()

Python3

df = df1.join(df2, df1.ID==df2.ID)
df.show()

Python3

df.select('Gender').show()

Python3

df.select('ID').show()

Python3

new_df = df1.join(df2, ["id"])
new_df.show()

输出：

+------+---+------+
|  Name| ID|Gender|
+------+---+------+
|   Ram|  1|     M|
|  Mike|  2|     M|
|Rohini|  3|     M|
| Maria|  4|     F|
| Jenis|  5|     F|
+------+---+------+

创建第二个数据框进行演示：

Python3

# Create data in dataframe
data2 = [(1,3000),
          (2,4000),
          (3,4000),
          (4,4000),
          (5, 1200)]
 
# Column names in dataframe
columns = ["ID","salary"]
 
# Create the spark dataframe
df2 = spark.createDataFrame(data=data2,
                            schema = columns)
 
# Print the dataframe
df2.show()

输出：

+---+------+
| ID|salary|
+---+------+
|  1|  3000|
|  2|  4000|
|  3|  4000|
|  4|  4000|
|  5|  1200|
+---+------+

使用连接（）

这将加入两个数据框

Syntax: dataframe.join(dataframe1).show()

where,

dataframe is the first dataframe

dataframe1 is the second dataframe

编程需要懂一点英语

让我们看看加入后的数据框：

Python3

df = df1.join(df2, df1.ID==df2.ID)
df.show()

输出：

+------+---+------+---+------+
|  Name| ID|Gender| ID|salary|
+------+---+------+---+------+
| Jenis|  5|     F|  5|  1200|
|   Ram|  1|     M|  1|  3000|
|Rohini|  3|     M|  3|  4000|
|  Mike|  2|     M|  2|  4000|
| Maria|  4|     F|  4|  4000|
+------+---+------+---+------+

在这里，我们看到 ID 和 Salary 列已添加到我们现有的文章中。

现在，让我们检查一下列：

在这里，我们检查了唯一的性别列，因此它工作正常。

Python3

df.select('Gender').show()

输出：

+------+
|Gender|
+------+
|     F|
|     M|
|     M|
|     M|
|     F|
+------+

现在让我们检查我们的重复列：

在这里它会因为重复的列而产生错误。

Python3

df.select('ID').show()

输出：

AnalysisException: Reference ‘ID’ is ambiguous, could be: ID, ID.

编程需要懂一点英语

在 PySpark 中加入后删除重复的列

如果我们想删除重复的列，那么我们必须在连接函数中指定重复的列。在这里，我们只是使用 join 来连接两个数据框，然后删除重复的列。

Syntax: dataframe.join(dataframe1, [‘column_name’]).show()

where,

dataframe is the first dataframe
dataframe1 is the second dataframe
column_name is the common column exists in two dataframes

编程需要懂一点英语

Python3

new_df = df1.join(df2, ["id"])
new_df.show()

输出：

+---+------+------+------+
| ID|  Name|Gender|salary|
+---+------+------+------+
|  5| Jenis|     F|  1200|
|  1|   Ram|     M|  3000|
|  3|Rohini|     M|  4000|
|  2|  Mike|     M|  4000|
|  4| Maria|     F|  4000|
+---+------+------+------+