如何在 PySpark 中合并多个数据框?
在本文中,我们将讨论如何在 PySpark 中合并多个数据帧。
方法一:pyspark中的union()函数
PySpark union()函数用于组合两个或多个具有相同结构或模式的数据帧。如果数据框的架构彼此不同,此函数将返回错误。
Syntax: data_frame1.union(data_frame2)
Where,
- data_frame1 and data_frame2 are the dataframes
示例 1:
Python3
# Python program to illustrate the
# working of union() function
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a dataframe
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98), ("Harshit", 80.31)],
["Student Name", "Overall Percentage"]
)
# Creating another dataframe
data_frame2 = spark.createDataFrame(
[("Naveen", 91.123), ("Piyush", 90.51)],
["Student Name", "Overall Percentage"]
)
# union()
answer = data_frame1.union(data_frame2)
# Print the result of the union()
answer.show()
Python3
# Python program to illustrate the working
# of union() function
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a data frame
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98), ("Harshit", 80.31)],
["Student Name", "Overall Percentage"]
)
# Creating another data frame
data_frame2 = spark.createDataFrame(
[(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
["Overall Percentage", "Student Name"]
)
# Union both the dataframes using uninonByName() method
answer = data_frame1.union(data_frame2)
# Print the combination of both the dataframes
answer.show()
Python3
# Python program to illustrate the working
# of unionByName() function
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a dataframe
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98), ("Harshit", 80.31)],
["Student Name", "Overall Percentage"]
)
# Creating another dataframe
data_frame2 = spark.createDataFrame(
[("Naveen", 91.123), ("Piyush", 90.51)],
["Student Name", "Overall Percentage"]
)
# Union both the dataframes using uninonByName() method
answer = data_frame1.unionByName(data_frame2)
# Print the result of the union()
answer.show()
Python3
# Python program to illustrate the
# working of unionByName() function
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a data frame
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98), ("Harshit", 80.31)],
["Student Name", "Overall Percentage"]
)
# Creating another data frame
data_frame2 = spark.createDataFrame(
[(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
["Overall Percentage", "Student Name"]
)
# Union both the dataframes using unionByName() method
answer = data_frame1.unionByName(data_frame2)
# Print the combination of both the dataframes
answer.show()
Python3
# Python program to illustrate the working
# of unionByName() function with an
# additional argument
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a dataframe
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98, "Computer Science"),
("Harshit", 80.31, "Information Technology")],
["Student Name", "Overall Percentage", "Department"]
)
# Creating another dataframe
data_frame2 = spark.createDataFrame(
[("Naveen", 91.123), ("Piyush", 90.51)],
["Student Name", "Overall Percentage"]
)
# Union both the dataframes using unionByName() method
res = data_frame1.unionByName(data_frame2, allowMissingColumns=True)
# Print the result of the union()
res.show()
输出:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| Naveen| 91.123|
| Piyush| 90.51|
+------------+------------------+
示例 2:
在此示例中,我们组合了两个数据帧 data_frame1 和 data_frame2。请注意,两个数据框的架构是不同的。因此,输出不是所需的输出,因为 union() 可以应用于具有相同结构的数据集。
Python3
# Python program to illustrate the working
# of union() function
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a data frame
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98), ("Harshit", 80.31)],
["Student Name", "Overall Percentage"]
)
# Creating another data frame
data_frame2 = spark.createDataFrame(
[(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
["Overall Percentage", "Student Name"]
)
# Union both the dataframes using uninonByName() method
answer = data_frame1.union(data_frame2)
# Print the combination of both the dataframes
answer.show()
输出:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| 91.123| Naveen|
| 90.51| Piyush|
| 87.67| Hitesh|
+------------+------------------+
方法2:pyspark中的UnionByName()函数
PySpark unionByName()函数也用于组合两个或多个数据帧,但它可能用于组合具有不同模式的数据帧。这是因为它通过列的名称而不是列的顺序来组合数据框。
Syntax: data_frame1.unionByName(data_frame2)
Where,
- data_frame1 and data_frame2 are the dataframes
示例 1:
在此示例中,两个数据帧 data_frame1 和 data_frame2 具有相同的模式。
Python3
# Python program to illustrate the working
# of unionByName() function
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a dataframe
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98), ("Harshit", 80.31)],
["Student Name", "Overall Percentage"]
)
# Creating another dataframe
data_frame2 = spark.createDataFrame(
[("Naveen", 91.123), ("Piyush", 90.51)],
["Student Name", "Overall Percentage"]
)
# Union both the dataframes using uninonByName() method
answer = data_frame1.unionByName(data_frame2)
# Print the result of the union()
answer.show()
输出:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| Naveen| 91.123|
| Piyush| 90.51|
+------------+------------------+
示例 2:
在此示例中,data_frame1 和 data_frame2 具有不同的架构,但输出是所需的。
Python3
# Python program to illustrate the
# working of unionByName() function
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a data frame
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98), ("Harshit", 80.31)],
["Student Name", "Overall Percentage"]
)
# Creating another data frame
data_frame2 = spark.createDataFrame(
[(91.123, "Naveen"), (90.51, "Piyush"), (87.67, "Hitesh")],
["Overall Percentage", "Student Name"]
)
# Union both the dataframes using unionByName() method
answer = data_frame1.unionByName(data_frame2)
# Print the combination of both the dataframes
answer.show()
输出:
+------------+------------------+
|Student Name|Overall Percentage|
+------------+------------------+
| Bhuwanesh| 82.98|
| Harshit| 80.31|
| Naveen| 91.123|
| Piyush| 90.51|
| Hitesh| 87.67|
+------------+------------------+
示例 3:
现在让我们考虑两个包含不相等列数(完全不同的模式)的数据帧。在这种情况下,我们需要将一个附加参数“allowMissingColumns = True”传递给 unionByName函数。
Python3
# Python program to illustrate the working
# of unionByName() function with an
# additional argument
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('GeeksforGeeks.com').getOrCreate()
# Creating a dataframe
data_frame1 = spark.createDataFrame(
[("Bhuwanesh", 82.98, "Computer Science"),
("Harshit", 80.31, "Information Technology")],
["Student Name", "Overall Percentage", "Department"]
)
# Creating another dataframe
data_frame2 = spark.createDataFrame(
[("Naveen", 91.123), ("Piyush", 90.51)],
["Student Name", "Overall Percentage"]
)
# Union both the dataframes using unionByName() method
res = data_frame1.unionByName(data_frame2, allowMissingColumns=True)
# Print the result of the union()
res.show()
输出:
+------------+------------------+--------------------+
|Student Name|Overall Percentage| Department|
+------------+------------------+--------------------+
| Bhuwanesh| 82.98| Computer Science|
| Harshit| 80.31|Information Techn...|
| Naveen| 91.123| null|
| Piyush| 90.51| null|
+------------+------------------+--------------------+