📜  如何在 PySpark 中合并多个数据框?

📅  最后修改于: 2022-05-13 01:54:55.716000             🧑  作者: Mango

如何在 PySpark 中合并多个数据框?

在本文中,我们将讨论如何在 PySpark 中合并多个数据帧。

方法一:pyspark中的union()函数

PySpark union()函数用于组合两个或多个具有相同结构或模式的数据帧。如果数据框的架构彼此不同,此函数将返回错误。

示例 1:

Python3
# Python program to illustrate the
# working of union() function
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# union()
answer = data_frame1.union(data_frame2)
  
# Print the result of the union()
answer.show()


Python3
# Python program to illustrate the working
# of union() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.union(data_frame2)
  
# Print the combination of both the dataframes
answer.show()


Python3
# Python program to illustrate the working
# of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the result of the union()
answer.show()


Python3
# Python program to illustrate the
# working of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using unionByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the combination of both the dataframes
answer.show()


Python3
# Python program to illustrate the working
# of unionByName() function with an
# additional argument
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98, "Computer Science"),
     ("Harshit", 80.31, "Information Technology")],
    ["Student Name", "Overall Percentage", "Department"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using unionByName() method
res = data_frame1.unionByName(data_frame2, allowMissingColumns=True)
  
# Print the result of the union()
res.show()


输出:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
+------------+------------------+

示例 2:

在此示例中,我们组合了两个数据帧 data_frame1 和 data_frame2。请注意,两个数据框的架构是不同的。因此,输出不是所需的输出,因为 union() 可以应用于具有相同结构的数据集。

Python3

# Python program to illustrate the working
# of union() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.union(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

输出:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      91.123|            Naveen|
|       90.51|            Piyush|
|       87.67|            Hitesh|
+------------+------------------+

方法2:pyspark中的UnionByName()函数

PySpark unionByName()函数也用于组合两个或多个数据帧,但它可能用于组合具有不同模式的数据帧。这是因为它通过列的名称而不是列的顺序来组合数据框。

示例 1:

在此示例中,两个数据帧 data_frame1 和 data_frame2 具有相同的模式。

Python3

# Python program to illustrate the working
# of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using uninonByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the result of the union()
answer.show()

输出:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
+------------+------------------+

示例 2:

在此示例中,data_frame1 和 data_frame2 具有不同的架构,但输出是所需的。

Python3

# Python program to illustrate the
# working of unionByName() function
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a data frame
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98), ("Harshit", 80.31)],
    ["Student Name", "Overall Percentage"]
)
  
# Creating another data frame
data_frame2 = spark.createDataFrame(
    [(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
    ["Overall Percentage", "Student Name"]
)
  
# Union both the dataframes using unionByName() method
answer = data_frame1.unionByName(data_frame2)
  
# Print the combination of both the dataframes
answer.show()

输出:

+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
|   Bhuwanesh|             82.98|
|     Harshit|             80.31|
|      Naveen|            91.123|
|      Piyush|             90.51|
|      Hitesh|             87.67|
+------------+------------------+

示例 3:

现在让我们考虑两个包含不相等列数(完全不同的模式)的数据帧。在这种情况下,我们需要将一个附加参数“allowMissingColumns = True”传递给 unionByName函数。

Python3

# Python program to illustrate the working
# of unionByName() function with an
# additional argument
  
import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
  
# Creating a dataframe
data_frame1 = spark.createDataFrame(
    [("Bhuwanesh", 82.98, "Computer Science"),
     ("Harshit", 80.31, "Information Technology")],
    ["Student Name", "Overall Percentage", "Department"]
)
  
# Creating another dataframe
data_frame2 = spark.createDataFrame(
    [("Naveen", 91.123), ("Piyush", 90.51)],
    ["Student Name", "Overall Percentage"]
)
  
# Union both the dataframes using unionByName() method
res = data_frame1.unionByName(data_frame2, allowMissingColumns=True)
  
# Print the result of the union()
res.show()

输出:

+------------+------------------+--------------------+
|Student Name|Overall Percentage|          Department|
+------------+------------------+--------------------+
|   Bhuwanesh|             82.98|    Computer Science|
|     Harshit|             80.31|Information Techn...|
|      Naveen|            91.123|                null|
|      Piyush|             90.51|                null|
+------------+------------------+--------------------+