如何在 PySpark 中按值排序?
在本文中,我们将在 PySpark 中按值排序。
创建 RDD 进行演示:
Python3
# importing module
from pyspark.sql import SparkSession, Row
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# create 2 Rows with 3 columns
data = Row(First_name="Sravan", Last_name="Kumar", age=23),
Row(First_name="Ojaswi", Last_name="Pinkey", age=16),
Row(First_name="Rohith", Last_name="Devi", age=7)
# create row on rdd
rdd = spark.sparkContext.parallelize(data)
# display data
rdd.collect()
Python3
# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()
Python3
# sort the data by values based on column 2
rdd.sortBy(lambda x: x[2]).collect()
Python3
# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))
# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))
输出:
[Row(First_name='Sravan', Last_name='Kumar', age=23),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7)]
方法 1:使用 sortBy()
sortBy() 用于在 pyspark 中有效地按值对数据进行排序。它是 rdd 中可用的方法。
Syntax: rdd.sortBy(lambda expression)
它使用 lambda 表达式根据列对数据进行排序。
lambda expression: lambda x: x[column_index]
示例 1:根据第 1 列按值对数据进行排序
蟒蛇3
# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()
输出:
[Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Sravan', Last_name='Kumar', age=23)]
示例 2:根据第 2 列的值对数据进行排序
蟒蛇3
# sort the data by values based on column 2
rdd.sortBy(lambda x: x[2]).collect()
输出:
[Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Sravan', Last_name='Kumar', age=23)]
方法 2:使用 takeOrdered()
它是 RDD 中可用的方法,用于根据特定列中的值对值进行排序。
Syntax: rdd.takeOrdered(n,lambda expression)
where, n is the total rows to be displayed after sorting
使用 takeOrdered函数根据特定列对值进行排序
蟒蛇3
# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))
# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))
输出:
[Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]
[Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]