PySpark 中具有多个条件的子集或过滤数据
有时在处理由多行和多列组成的大数据框时,我们必须过滤数据框,或者我们希望根据需要应用操作的数据框子集。为了获取子集或过滤数据,有时只有一个条件是不够的,我们必须多次传递多个条件来过滤或获取该数据帧的子集。因此,在本文中,我们将学习如何根据 PySpark 数据帧中的多个条件进行 ro 子集或过滤。
要从数据框中对数据进行子集化或过滤,我们使用filter()函数。 filter函数用于根据给定的条件过滤数据帧中的数据,它应该是单个或多个。
Syntax: df.filter(condition)
where df is the dataframe from which the data is subset or filtered.
我们可以通过两种方式将多个条件传递给函数:
- 使用双引号(“条件”)
- 在条件中使用点符号
让我们创建一个数据框。
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Student_report.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1, "Shivansh", "Male", 20, 80),
(2, "Arpita", "Female", 18, 66),
(3, "Raj", "Male", 21, 90),
(4, "Swati", "Female", 19, 91),
(5, "Arpit", "Male", 20, 50),
(6, "Swaroop", "Male", 23, 65),
(7, "Reshabh", "Male", 19, 70)]
schema = ["Id", "Name", "Gender", "Age", "Percentage"]
# calling function to create dataframe
df = create_df(spark, input_data, schema)
df.show()
Python
# subset or filter the dataframe by
# passing Multiple condition
df = df.filter("Gender == 'Male' and Percentage>70")
df.show()
Python
# subset or filter the data with
# multiple condition
df = df.filter("Age>20 or Percentage>80")
df.show()
Python
# subset or filter the dataframe by
# passing Multiple condition
df = df.filter((df.Gender=='Female') & (df.Age>=18))
df.show()
Python
# subset or filter the data with
# multiple condition
df = df.filter((df.Gender=='Male') | (df.Percentage>90))
df.show()
输出:
让我们在这里应用过滤器:
示例 1:在 (“”) 双引号运算符使用'和'运算符
Python
# subset or filter the dataframe by
# passing Multiple condition
df = df.filter("Gender == 'Male' and Percentage>70")
df.show()
输出:
示例 2:在 (“”) 双引号运算符使用“或”运算符
Python
# subset or filter the data with
# multiple condition
df = df.filter("Age>20 or Percentage>80")
df.show()
输出:
示例 3:将“& ”运算符与(.)运算符一起使用
Python
# subset or filter the dataframe by
# passing Multiple condition
df = df.filter((df.Gender=='Female') & (df.Age>=18))
df.show()
输出:
示例 4:使用'| '运算符与(.)运算符
Python
# subset or filter the data with
# multiple condition
df = df.filter((df.Gender=='Male') | (df.Percentage>90))
df.show()
输出: