📜  union dataframe pyspark - Python (1)

📅  最后修改于: 2023-12-03 14:48:11.036000             🧑  作者: Mango

Union DataFrames in PySpark

In PySpark, the union operation is used to combine two DataFrames vertically, i.e., by appending one DataFrame below another. This allows you to consolidate data from multiple DataFrames into a single DataFrame.

Syntax

The basic syntax to perform a union operation in PySpark is as follows:

new_df = df1.union(df2)

Here, df1 and df2 are two DataFrames that you want to combine. The resulting DataFrame new_df will contain all the rows from df1 followed by all the rows from df2.

Example

Let's consider a simple example to demonstrate the union operation in PySpark:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("UnionExample").getOrCreate()

# Create two sample DataFrames
data1 = [("John", 25), ("Alice", 30)]
df1 = spark.createDataFrame(data1, ["Name", "Age"])

data2 = [("Bob", 35), ("Eve", 28)]
df2 = spark.createDataFrame(data2, ["Name", "Age"])

# Perform union operation
new_df = df1.union(df2)

# Display the result
new_df.show()

Output:

+-----+---+
| Name|Age|
+-----+---+
| John| 25|
|Alice| 30|
|  Bob| 35|
|  Eve| 28|
+-----+---+

In the above example, we create two DataFrames df1 and df2 with some sample data. Then, we perform a union operation using df1.union(df2) and assign the result to a new DataFrame new_df. Finally, we display the contents of new_df using the show() method.

Requirements

To execute the above code, you need to have Apache Spark installed and the pyspark Python package configured in your environment.

Conclusion

The union operation in PySpark allows you to vertically combine two DataFrames, enabling you to consolidate data from various sources. Understanding how to use this operation is essential for any PySpark programmer working with multiple DataFrames.