PySpark – orderBy() 和 sort()
在本文中,我们将看到如何在PySpark 中按指定列对数据框进行排序。我们可以利用orderBy()和sort()对 PySpark 中的数据框进行排序
OrderBy() 方法:
OrderBy()函数用于按对象的索引值对其进行排序。
Syntax: DataFrame.orderBy(cols, args)
Parameters :
- cols: List of columns to be ordered
- args: Specifies the sorting order i.e (ascending or descending) of columns listed in cols
Return type: Returns a new DataFrame sorted by the specified columns.
数据框创建:创建一个名为spark的新SparkSession 对象,然后使用自定义数据创建一个数据框。
Python3
# Importing necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
# Create a spark session
spark = SparkSession.builder.appName(
'pyspark - example join').getOrCreate()
# Define data in a dataframe
dataframe = [
("Sam", "Software Engineer", "IND", 10000),
("Raj", "Data Scientist", "US", 41000),
("Jonas", "Sales Person", "UK", 230000),
("Peter", "CTO", "Ireland", 50000),
("Hola", "Data Analyst", "Australia", 111000),
("Ram", "CEO", "Iran", 300000),
("Lekhana", "Advertising", "UK", 250000),
("Thanos", "Marketing", "UIND", 114000),
("Nick", "Data Engineer", "Ireland", 680000),
("Wade", "Data Engineer", "IND", 70000)
]
# Column names of dataframe
columns = ["Name", "Job", "Country", "salary"]
# Create the spark dataframe
df = spark.createDataFrame(data=dataframe, schema=columns)
# Printing the dataframe
df.show()
Python3
# Order the data by ascending order
# of Salary
df.orderBy(['Salary'], ascending = [True]).show()
# or
# df.orderBy(f.col("Salary").asc()).show()
# or
# df.orderBy(['Salary']).show()
Python3
# Order the data by dec order
# of Salary
df.orderBy(['Salary'], ascending = [False]).show()
Python3
# Sort the dataframe by descending order
# of 'Job' and whenever there is conflict
# in 'Job', it'll be resolved by ordering
# based on ascending order of 'Salary'
df.orderBy(f.col("Job").desc(),f.col("Salary").asc()).show()
# or
# df.orderBy(["Job", "Salary"],ascending = [False, True]).show()
Python3
# Sort the dataframe by ascending
# order of 'Name'
df.sort(["Name"],ascending = [True]).show()
Python3
# Sort the dataframe by scendding order of 'Name'
df.sort(["Name"],ascending = [False]).show()
Python3
# Sort the dataframe by acendding order of 'Name'
df.sort(["Name","salary"],ascending = [True]).show()
输出 :
示例 1:按单列对数据框进行排序
按数据框中员工“薪水”的升序对数据框进行排序。
蟒蛇3
# Order the data by ascending order
# of Salary
df.orderBy(['Salary'], ascending = [True]).show()
# or
# df.orderBy(f.col("Salary").asc()).show()
# or
# df.orderBy(['Salary']).show()
输出 :
示例 2:按降序对数据框进行排序。
蟒蛇3
# Order the data by dec order
# of Salary
df.orderBy(['Salary'], ascending = [False]).show()
输出:
示例 3:按多列对数据框进行排序
按数据框中员工的“Job”降序和“Salary”升序对数据框进行排序。当具有相同“工作”的两行之间存在冲突时,将通过按“薪水”的升序列出行来解决。
蟒蛇3
# Sort the dataframe by descending order
# of 'Job' and whenever there is conflict
# in 'Job', it'll be resolved by ordering
# based on ascending order of 'Salary'
df.orderBy(f.col("Job").desc(),f.col("Salary").asc()).show()
# or
# df.orderBy(["Job", "Salary"],ascending = [False, True]).show()
输出 :
排序()方法:
它以布尔值作为参数以升序或降序排序。
Syntax:
sort(x, decreasing, na.last)
Parameters:
x: list of Column or column names to sort by
decreasing: Boolean value to sort in descending order
na.last: Boolean value to put NA at the end
示例 1:按员工“姓名”的升序对数据框进行排序。
蟒蛇3
# Sort the dataframe by ascending
# order of 'Name'
df.sort(["Name"],ascending = [True]).show()
输出 :
示例 2:按降序对列进行排序。
蟒蛇3
# Sort the dataframe by scendding order of 'Name'
df.sort(["Name"],ascending = [False]).show()
输出:
示例 3:按升序对多列进行排序。
蟒蛇3
# Sort the dataframe by acendding order of 'Name'
df.sort(["Name","salary"],ascending = [True]).show()
输出: