📜  pyspark case when - Python (1)

📅  最后修改于: 2023-12-03 15:04:02.005000             🧑  作者: Mango

PySpark Case When

PySpark Case When is a powerful tool that enables programmers to perform conditional operations on data using SQL-like syntax in PySpark. This tool is commonly used in data analysis and data processing, where you need to transform or manipulate data based on specific conditions.

Syntax
from pyspark.sql.functions import when, col

df.withColumn('new_col', when(col('col1')==val1, val2).otherwise(val3))

In the above syntax, withColumn() function adds a new column to the DataFrame. when() function takes a Boolean expression as the first argument and a value as the second argument. If the Boolean expression evaluates to true, the corresponding value is assigned to the new column. If the Boolean expression evaluates to false, the third argument passed to otherwise() function is assigned to the new column.

Example
from pyspark.sql.functions import when, col

data = [("John", 25), ("Mary", 20), ("Mike", 30)]
df = spark.createDataFrame(data, ["name", "age"])

df = df.withColumn("age_group", when(col("age") < 25, "young").otherwise("old"))

df.show()

In the above example, a DataFrame is created with two columns "name" and "age". withColumn() function is used to add a new column "age_group". Using when() function, we've checked if the age is less than 25, it's assigned to "young" category, otherwise, it's assigned to "old" category.

Conclusion

PySpark Case When is a versatile tool that enables programmers to perform conditional operations on the DataFrame using SQL-like syntax. It saves a lot of time and effort required to write complex if-else statements.