Python PySpark – 多列上的数据帧过滤器
在本文中,我们将在Python 的Pyspark 中使用 filter() 和 where()函数过滤多列上的数据框。
为演示创建数据框:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [[1, "sravan", "company 1"],
[2, "ojaswi", "company 1"],
[3, "rohith", "company 2"],
[4, "sridevi", "company 1"],
[1, "sravan", "company 1"],
[4, "sridevi", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
# display dataframe
dataframe.show()
Python3
# select dataframe where ID less than 3
dataframe.filter(dataframe.ID < 3).show()
Python3
# select dataframe where ID less than
# 3 or name is sridevi
dataframe.filter((dataframe.ID < 3) |
(dataframe.NAME == 'sridevi')).show()
Python3
# select dataframe where ID less than
# 3 or name is sridevi and comapny 1
dataframe.filter((dataframe.ID < 3) | (
(dataframe.NAME == 'sridevi') &
(dataframe.Company == 'company 1'))).show()
Python3
# select dataframe where ID less than
# 3 or name is sridevi and comapny 1
dataframe.where((dataframe.ID < 3) | (
(dataframe.NAME == 'sridevi') &
(dataframe.Company == 'company 1'))).show()
输出:
方法一:使用 filter() 方法
filter() 用于根据给定条件返回数据帧,方法是删除数据帧中的行或从数据帧中提取特定的行或列。我们将在多列上过滤数据框。它可以接受一个条件并返回数据帧。
句法:
filter(dataframe.column condition)
示例 1:条件运算符包括布尔运算符或逻辑运算符或关系运算符。
蟒蛇3
# select dataframe where ID less than 3
dataframe.filter(dataframe.ID < 3).show()
输出:
示例 2:基于两列过滤数据的Python程序。在此示例中,我们创建了一个 pyspark 数据框并选择 ID 小于 3 或名称为 Sridevi 的数据框
蟒蛇3
# select dataframe where ID less than
# 3 or name is sridevi
dataframe.filter((dataframe.ID < 3) |
(dataframe.NAME == 'sridevi')).show()
输出:
示例 3:多列过滤
蟒蛇3
# select dataframe where ID less than
# 3 or name is sridevi and comapny 1
dataframe.filter((dataframe.ID < 3) | (
(dataframe.NAME == 'sridevi') &
(dataframe.Company == 'company 1'))).show()
输出:
方法二:where()方法
其中: where 类似于 filter()函数,用于根据给定条件通过删除数据帧中的行或通过从数据帧中提取特定行或列来返回数据帧。它可以接受一个条件并返回数据帧。
where(dataframe.column condition)
示例 1:用于过滤多列的Python程序
蟒蛇3
# select dataframe where ID less than
# 3 or name is sridevi and comapny 1
dataframe.where((dataframe.ID < 3) | (
(dataframe.NAME == 'sridevi') &
(dataframe.Company == 'company 1'))).show()
输出: