在 PySpark Dataframe 中按条件计算值
在本文中,我们将按条件计算 Pyspark 数据框列的值。
创建用于演示的数据框:
Python3
# importing module
import pyspark
# importing sparksession from
# pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 10 row values
data = [["1", "sravan", "IT", 45000],
["2", "ojaswi", "IT", 30000],
["3", "bobby", "business", 45000],
["4", "rohith", "IT", 45000],
["5", "gnanesh", "business", 120000],
["6", "siva nagulu", "sales", 23000],
["7", "bhanu", "sales", 34000],
["8", "sireesha", "business", 456798],
["9", "ravi", "IT", 230000],
["10", "devi", "business", 100000],
]
# specify column names
columns = ['ID', 'NAME', 'sector', 'salary']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# display dataframe
dataframe.show()
Python3
# count values in NAME column
# where ID greater than 5
dataframe.select('NAME').where(dataframe.ID>5).count()
Python3
# count values in all column count
# where ID greater than 3 and sector = IT
dataframe.select().where((dataframe.ID>3) &
(dataframe.sector=='IT')).count()
Python3
# count ID column where ID =4
dataframe.select('ID').where(dataframe.ID == 4).count()
Python3
# count ID column where ID > 4
# and sector is sales or IT
dataframe.select('ID').where((dataframe.ID>4) &
((dataframe.sector=='sales')|
(dataframe.sector=='IT'))).count()
输出:
方法一:使用select()、where()、count()
where(): where 用于通过选择数据框中的行或从数据框中提取特定的行或列,根据给定条件返回数据框。它可以接受一个条件并返回数据帧
Syntax: where(dataframe.column condition)
Where,
- Here dataframe is the input dataframe
- column is the column name where we have to raise a condition
count():此函数用于返回数据框中的值/行数
Syntax: dataframe.count()
示例 1: Python程序对 ID 大于 5 的 NAME 列中的值进行计数
蟒蛇3
# count values in NAME column
# where ID greater than 5
dataframe.select('NAME').where(dataframe.ID>5).count()
输出:
5
示例 2: Python程序对 ID 大于 3 且扇区 = IT 的所有列计数中的值进行计数
蟒蛇3
# count values in all column count
# where ID greater than 3 and sector = IT
dataframe.select().where((dataframe.ID>3) &
(dataframe.sector=='IT')).count()
输出:
2
方法二:使用filter()、count()
filter():它用于通过删除数据框中的行或从数据框中提取特定的行或列,根据给定条件返回数据框。它可以接受一个条件并返回数据帧
Syntax: filter(dataframe.column condition)
Where,
- Here dataframe is the input dataframe
- column is the column name where we have to raise a condition
示例 1: Python程序计算 ID =4 的 ID 列
蟒蛇3
# count ID column where ID =4
dataframe.select('ID').where(dataframe.ID == 4).count()
输出:
1
示例 2:计算 ID > 4 且部门为销售或 IT 的 ID 列的Python程序
蟒蛇3
# count ID column where ID > 4
# and sector is sales or IT
dataframe.select('ID').where((dataframe.ID>4) &
((dataframe.sector=='sales')|
(dataframe.sector=='IT'))).count()
输出:
3