📜  如何在 Pyspark DataFrame 中选择和排序多列?

📅  最后修改于: 2022-05-13 01:55:05.414000             🧑  作者: Mango

如何在 Pyspark DataFrame 中选择和排序多列?

在本文中,我们将讨论如何使用Python的pyspark 从数据框中选择和排序多列。为此,我们使用 sort() 和 orderBy() 函数以及 select()函数。

使用的方法

  • Select():此方法用于选择数据框列的一部分并返回该新选择的数据框的副本。
  • sort():此方法用于对数据帧的数据进行排序并返回新排序的数据帧的副本。默认情况下,这会按升序对数据框进行排序。
  • oderBy():该方法类似于sort,也用于对数据框进行排序。默认情况下,该方法按升序对数据框进行排序。

让我们创建一个示例数据框



Python3
# importing module
import pyspark
  
# importing sparksession from 
# pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "sravan", "vignan"], ["2", "ojaswi", "vvit"],
        ["3", "rohith", "vvit"], ["4", "sridevi", "vignan"],
        ["1", "sravan", "vignan"], ["5", "gnanesh", "iit"]]
  
# specify column names
columns = ['student ID', 'student NAME', 'college']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
print("Actual data in dataframe")
# show dataframe
dataframe.show()


Python3
# show dataframe by sorting the dataframe
# based on two columns in ascending
# order using sort() function
dataframe.select(['student ID', 'student NAME']
                ).sort(['student ID', 'student NAME'], 
                       ascending=True).show()


Python3
# show dataframe by sorting the dataframe
# based on three columns in desc order
# using sort() function
dataframe.select(['student ID', 'student NAME', 'college']
                ).sort(['student ID', 'student NAME', 'college'],
                       ascending=False).show()


Python3
# show dataframe by sorting the dataframe
# based on three columns in desc
# order using orderBy() function
dataframe.select(['student ID', 'student NAME', 'college']
                ).orderBy(['student ID', 'student NAME', 'college'],
                          ascending=False).show()


Python3
# show dataframe by sorting the dataframe
# based on two columns in asc
# order using orderBy() function
dataframe.select(['student NAME', 'college']
                ).orderBy(['student NAME', 'college'],
                          ascending=True).show()


输出:

使用 sort() 方法选择多列并排序

蟒蛇3

# show dataframe by sorting the dataframe
# based on two columns in ascending
# order using sort() function
dataframe.select(['student ID', 'student NAME']
                ).sort(['student ID', 'student NAME'], 
                       ascending=True).show()

输出:

蟒蛇3



# show dataframe by sorting the dataframe
# based on three columns in desc order
# using sort() function
dataframe.select(['student ID', 'student NAME', 'college']
                ).sort(['student ID', 'student NAME', 'college'],
                       ascending=False).show()

输出:

使用 orderBy() 方法选择多列并排序

蟒蛇3

# show dataframe by sorting the dataframe
# based on three columns in desc
# order using orderBy() function
dataframe.select(['student ID', 'student NAME', 'college']
                ).orderBy(['student ID', 'student NAME', 'college'],
                          ascending=False).show()

输出:

蟒蛇3

# show dataframe by sorting the dataframe
# based on two columns in asc
# order using orderBy() function
dataframe.select(['student NAME', 'college']
                ).orderBy(['student NAME', 'college'],
                          ascending=True).show()

输出: