📜  在 PySpark Dataframe 中按条件计算值

📅  最后修改于: 2022-05-13 01:54:30.262000             🧑  作者: Mango

在 PySpark Dataframe 中按条件计算值

在本文中,我们将按条件计算 Pyspark 数据框列的值。

创建用于演示的数据框:

Python3
# importing module
import pyspark
 
# importing sparksession from
# pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data with 10 row values
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "IT", 30000],
        ["3", "bobby", "business", 45000],
        ["4", "rohith", "IT", 45000],
        ["5", "gnanesh", "business", 120000],
        ["6", "siva nagulu", "sales", 23000],
        ["7", "bhanu", "sales", 34000],
        ["8", "sireesha", "business", 456798],
        ["9", "ravi", "IT", 230000],
        ["10", "devi", "business", 100000],
        ]
 
# specify column names
columns = ['ID', 'NAME', 'sector', 'salary']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
 
# display dataframe
dataframe.show()


Python3
# count values in NAME column
# where ID greater than 5
dataframe.select('NAME').where(dataframe.ID>5).count()


Python3
# count values in all column count
# where ID greater than 3 and sector = IT
dataframe.select().where((dataframe.ID>3) &
                         (dataframe.sector=='IT')).count()


Python3
# count ID column  where ID =4
dataframe.select('ID').where(dataframe.ID == 4).count()


Python3
# count ID column  where ID > 4 
# and sector is sales or IT
dataframe.select('ID').where((dataframe.ID>4) &
                             ((dataframe.sector=='sales')|
                              (dataframe.sector=='IT'))).count()


输出:



方法一:使用select()、where()、count()

where(): where 用于通过选择数据框中的行或从数据框中提取特定的行或列,根据给定条件返回数据框。它可以接受一个条件并返回数据帧

count():此函数用于返回数据框中的值/行数

示例 1: Python程序对 ID 大于 5 的 NAME 列中的值进行计数

蟒蛇3

# count values in NAME column
# where ID greater than 5
dataframe.select('NAME').where(dataframe.ID>5).count()




输出:

5


示例 2: Python程序对 ID 大于 3 且扇区 = IT 的所有列计数中的值进行计数

蟒蛇3

# count values in all column count
# where ID greater than 3 and sector = IT
dataframe.select().where((dataframe.ID>3) &
                         (dataframe.sector=='IT')).count()


输出:

2

方法二:使用filter()、count()

filter():它用于通过删除数据框中的行或从数据框中提取特定的行或列,根据给定条件返回数据框。它可以接受一个条件并返回数据帧

示例 1: Python程序计算 ID =4 的 ID 列



蟒蛇3

# count ID column  where ID =4
dataframe.select('ID').where(dataframe.ID == 4).count()


输出:

1

示例 2:计算 ID > 4 且部门为销售或 IT 的 ID 列的Python程序

蟒蛇3

# count ID column  where ID > 4 
# and sector is sales or IT
dataframe.select('ID').where((dataframe.ID>4) &
                             ((dataframe.sector=='sales')|
                              (dataframe.sector=='IT'))).count()


输出:

3