如何在 PySpark 中合并多个数据框？

在本文中，我们将讨论如何在 PySpark 中合并多个数据帧。

方法一：pyspark中的union()函数

PySpark union()函数用于组合两个或多个具有相同结构或模式的数据帧。如果数据框的架构彼此不同，此函数将返回错误。

Syntax: data_frame1.union(data_frame2)

Where,

data_frame1 and data_frame2 are the dataframes

编程需要懂一点英语

示例 1：

Python3

# Python program to illustrate the
# working of union() function
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# union()
answer = data_frame1.union(data_frame2)
  
# Print the result of the union()
answer.show()

Python3

# Python program to illustrate the working
# of union() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.union(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

Python3

# Python program to illustrate the working
# of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the result of the union()
answer.show()

Python3

# Python program to illustrate the
# working of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using unionByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

Python3

# Python program to illustrate the working
# of unionByName() function with an
# additional argument
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98, "Computer Science"),
     ("Harshit", 80.31, "Information Technology")],
    ["Student Name", "Overall Percentage", "Department"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using unionByName() method
res = data_frame1.unionByName(data_frame2, allowMissingColumns=True)
  
# Print the result of the union()
res.show()

输出：

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
+------------+------------------+

示例 2：

在此示例中，我们组合了两个数据帧 data_frame1 和 data_frame2。请注意，两个数据框的架构是不同的。因此，输出不是所需的输出，因为 union() 可以应用于具有相同结构的数据集。

Python3

# Python program to illustrate the working
# of union() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.union(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

输出：

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      91.123|            Naveen|
|       90.51|            Piyush|
|       87.67|            Hitesh|
+------------+------------------+

方法2：pyspark中的UnionByName()函数

PySpark unionByName()函数也用于组合两个或多个数据帧，但它可能用于组合具有不同模式的数据帧。这是因为它通过列的名称而不是列的顺序来组合数据框。

Syntax: data_frame1.unionByName(data_frame2)

Where,

data_frame1 and data_frame2 are the dataframes

编程需要懂一点英语

示例 1：

在此示例中，两个数据帧 data_frame1 和 data_frame2 具有相同的模式。

Python3

# Python program to illustrate the working
# of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the result of the union()
answer.show()

输出：

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
+------------+------------------+

示例 2：

在此示例中，data_frame1 和 data_frame2 具有不同的架构，但输出是所需的。

Python3

# Python program to illustrate the
# working of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using unionByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

输出：

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
|      Hitesh|             87.67|
+------------+------------------+

示例 3：

现在让我们考虑两个包含不相等列数（完全不同的模式）的数据帧。在这种情况下，我们需要将一个附加参数“allowMissingColumns = True”传递给 unionByName函数。

Python3

# Python program to illustrate the working
# of unionByName() function with an
# additional argument
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98, "Computer Science"),
     ("Harshit", 80.31, "Information Technology")],
    ["Student Name", "Overall Percentage", "Department"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using unionByName() method
res = data_frame1.unionByName(data_frame2, allowMissingColumns=True)
  
# Print the result of the union()
res.show()

输出：

+------------+------------------+--------------------+
|Student Name|Overall Percentage|          Department|
+------------+------------------+--------------------+
|   Bhuwanesh|             82.98|    Computer Science|
|     Harshit|             80.31|Information Techn...|
|      Naveen|            91.123|                null|
|      Piyush|             90.51|                null|
+------------+------------------+--------------------+