📅  最后修改于: 2023-12-03 14:48:11.036000             🧑  作者: Mango
In PySpark, the union
operation is used to combine two DataFrames vertically, i.e., by appending one DataFrame below another. This allows you to consolidate data from multiple DataFrames into a single DataFrame.
The basic syntax to perform a union operation in PySpark is as follows:
new_df = df1.union(df2)
Here, df1
and df2
are two DataFrames that you want to combine. The resulting DataFrame new_df
will contain all the rows from df1
followed by all the rows from df2
.
Let's consider a simple example to demonstrate the union operation in PySpark:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("UnionExample").getOrCreate()
# Create two sample DataFrames
data1 = [("John", 25), ("Alice", 30)]
df1 = spark.createDataFrame(data1, ["Name", "Age"])
data2 = [("Bob", 35), ("Eve", 28)]
df2 = spark.createDataFrame(data2, ["Name", "Age"])
# Perform union operation
new_df = df1.union(df2)
# Display the result
new_df.show()
Output:
+-----+---+
| Name|Age|
+-----+---+
| John| 25|
|Alice| 30|
| Bob| 35|
| Eve| 28|
+-----+---+
In the above example, we create two DataFrames df1
and df2
with some sample data. Then, we perform a union operation using df1.union(df2)
and assign the result to a new DataFrame new_df
. Finally, we display the contents of new_df
using the show()
method.
To execute the above code, you need to have Apache Spark installed and the pyspark
Python package configured in your environment.
The union
operation in PySpark allows you to vertically combine two DataFrames, enabling you to consolidate data from various sources. Understanding how to use this operation is essential for any PySpark programmer working with multiple DataFrames.