📅  最后修改于: 2023-12-03 15:04:02.005000             🧑  作者: Mango
PySpark Case When is a powerful tool that enables programmers to perform conditional operations on data using SQL-like syntax in PySpark. This tool is commonly used in data analysis and data processing, where you need to transform or manipulate data based on specific conditions.
from pyspark.sql.functions import when, col
df.withColumn('new_col', when(col('col1')==val1, val2).otherwise(val3))
In the above syntax, withColumn()
function adds a new column to the DataFrame. when()
function takes a Boolean expression as the first argument and a value as the second argument. If the Boolean expression evaluates to true, the corresponding value is assigned to the new column. If the Boolean expression evaluates to false, the third argument passed to otherwise()
function is assigned to the new column.
from pyspark.sql.functions import when, col
data = [("John", 25), ("Mary", 20), ("Mike", 30)]
df = spark.createDataFrame(data, ["name", "age"])
df = df.withColumn("age_group", when(col("age") < 25, "young").otherwise("old"))
df.show()
In the above example, a DataFrame is created with two columns "name" and "age". withColumn()
function is used to add a new column "age_group". Using when()
function, we've checked if the age is less than 25, it's assigned to "young" category, otherwise, it's assigned to "old" category.
PySpark Case When is a versatile tool that enables programmers to perform conditional operations on the DataFrame using SQL-like syntax. It saves a lot of time and effort required to write complex if-else statements.