📜  PySpark – GroupBy 并按降序对 DataFrame 进行排序

📅  最后修改于: 2022-05-13 01:55:19.827000             🧑  作者: Mango

PySpark – GroupBy 并按降序对 DataFrame 进行排序

在本文中,我们将讨论如何对 PySpark DataFrame 进行分组,然后按降序对其进行排序。

使用的方法

  • groupBy(): pyspark 中的 groupBy()函数用于对 DataFrame 上的相同分组数据,同时对分组数据执行聚合函数。
  • sort(): sort()函数用于对一列或多列进行排序。默认情况下,它按升序排序。
  • PySpark DataFrame 还提供orderBy()函数来对一列或多列进行排序。默认情况下,它按升序排序。

示例 1:在此示例中,我们将按名称和聚合标记对数据框进行分组。我们将使用 sort()函数对表进行排序,在该函数中我们将使用 col()函数和 desc()函数访问列以按降序对其进行排序。

Python3
# import the required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, col, desc
  
# Start spark session
spark = SparkSession.builder.appName("GeeksForGeeks").getOrCreate()
  
# Define sample data
simpleData = [("Pulkit","trial_1",32),
    ("Ritika","trial_1",42),
    ("Pulkit","trial_2",45),
    ("Ritika","trial_2",50),
    ("Ritika","trial_3",62),
    ("Pulkit","trial_3",55),
    ("Ritika","trial_4",75),
    ("Pulkit","trial_4",70)
  ]
  
# define the schema
schema = ["Name","Number_of_Trials","Marks"]
  
# create a dataframe
df = spark.createDataFrame(data=simpleData, schema = schema)
  
# group by name and aggrigate using
# average marks sort the column using
# col and desc() function
df.groupBy("Name") \
  .agg(avg("Marks").alias("Avg_Marks")) \
  .sort(col("Avg_Marks").desc()) \
  .show()
  
# stop spark session
spark.stop()


Python3
# import the required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, col, desc
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# sample dataset
simpleData = [("Pulkit","trial_1",32),
    ("Ritika","trial_1",42),
    ("Pulkit","trial_2",45),
    ("Ritika","trial_2",50),
    ("Ritika","trial_3",62),
    ("Pulkit","trial_3",55),
    ("Ritika","trial_4",75),
    ("Pulkit","trial_4",70)
  ]
  
# define the schema to be used
schema = ["Name","Number_of_Trials","Marks"]
  
# create the dataframe
df = spark.createDataFrame(data=simpleData, schema = schema)
  
# perform groupby operation on name table
# aggrigate marks and give it a new name
# sort in descending order by avg_marks
df.groupBy("Name") \
  .agg(avg("Marks").alias("Avg_Marks")) \
  .sort(desc("Avg_Marks")) \
  .show()
  
# stop sparks session
spark.stop()


Python3
# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, col, desc
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# sample dataset
simpleData = [("Pulkit","trial_1",32),
    ("Ritika","trial_1",42),
    ("Pulkit","trial_2",45),
    ("Ritika","trial_2",50),
    ("Ritika","trial_3",62),
    ("Pulkit","trial_3",55),
    ("Ritika","trial_4",75),
    ("Pulkit","trial_4",70)
  ]
  
# define the schema
schema = ["Name","Number_of_Trials","Marks"]
  
# create a dataframe
df = spark.createDataFrame(data=simpleData, schema = schema)
  
df.groupBy("Name")\
    .agg(avg("Marks").alias("Avg_Marks"))\
    .orderBy("Avg_Marks", ascending=False)\
    .show()
  
# stop sparks session
spark.stop()


输出:

示例 2:在此示例中,我们将按名称和聚合标记对数据框进行分组。我们将使用 sort()函数对表进行排序,在该函数中我们将访问 desc()函数的列以按降序对其进行排序。

蟒蛇3



# import the required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, col, desc
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# sample dataset
simpleData = [("Pulkit","trial_1",32),
    ("Ritika","trial_1",42),
    ("Pulkit","trial_2",45),
    ("Ritika","trial_2",50),
    ("Ritika","trial_3",62),
    ("Pulkit","trial_3",55),
    ("Ritika","trial_4",75),
    ("Pulkit","trial_4",70)
  ]
  
# define the schema to be used
schema = ["Name","Number_of_Trials","Marks"]
  
# create the dataframe
df = spark.createDataFrame(data=simpleData, schema = schema)
  
# perform groupby operation on name table
# aggrigate marks and give it a new name
# sort in descending order by avg_marks
df.groupBy("Name") \
  .agg(avg("Marks").alias("Avg_Marks")) \
  .sort(desc("Avg_Marks")) \
  .show()
  
# stop sparks session
spark.stop()

输出:

示例 3:在本示例中,我们将按名称和聚合标记对数据框进行分组。我们将使用 orderBy()函数对表进行排序,在该函数中,我们将升序参数作为 False 传递,以按降序对数据进行排序。

蟒蛇3

# import required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, col, desc
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# sample dataset
simpleData = [("Pulkit","trial_1",32),
    ("Ritika","trial_1",42),
    ("Pulkit","trial_2",45),
    ("Ritika","trial_2",50),
    ("Ritika","trial_3",62),
    ("Pulkit","trial_3",55),
    ("Ritika","trial_4",75),
    ("Pulkit","trial_4",70)
  ]
  
# define the schema
schema = ["Name","Number_of_Trials","Marks"]
  
# create a dataframe
df = spark.createDataFrame(data=simpleData, schema = schema)
  
df.groupBy("Name")\
    .agg(avg("Marks").alias("Avg_Marks"))\
    .orderBy("Avg_Marks", ascending=False)\
    .show()
  
# stop sparks session
spark.stop()

输出: