📜  如何命名 PySpark DataFrame 中的聚合列?

📅  最后修改于: 2022-05-13 01:54:42.827000             🧑  作者: Mango

如何命名 PySpark DataFrame 中的聚合列?

在本文中,我们将了解如何在 Pyspark 数据框中命名聚合列。

我们可以通过在 groupBy() 之后使用别名来做到这一点。 groupBy() 用于连接两列并用于聚合列,别名用于更改通过将列中的数据分组而形成的新列的名称。

创建用于演示的数据框:



Python3
# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 10 row values
data =[["1","sravan","IT",45000],
       ["2","ojaswi","IT",30000],
       ["3","bobby","business",45000],
       ["4","rohith","IT",45000],
       ["5","gnanesh","business",120000],
       ["6","siva nagulu","sales",23000],
       ["7","bhanu","sales",34000],
       ["8","sireesha","business",456798],
       ["9","ravi","IT",230000],
       ["10","devi","business",100000],
       ]
  
# specify column names
columns=['ID','NAME','sector','salary']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
  
# display dataframe
dataframe.show()


Python3
# importing sum function
from pyspark.sql.functions import sum
  
# group the salary among different sectors
# and name  as Employee_salary by sum aggregation
dataframe.groupBy(
  "sector").agg(sum("salary").alias("Employee_salary")).show()


Python3
# importing avg function
from pyspark.sql.functions import avg
  
# group the salary among different sectors
# and name  as Average_Employee_salary
# by average aggregation
dataframe.groupBy("sector") 
.agg(avg(
  "salary").alias("Average_Employee_salary")).show()


Python3
# importing count function
from pyspark.sql.functions import count
  
# group the salary among different 
# sectors and name  as Total-People
# by count aggregation
dataframe.groupBy("sector") 
.agg(count(
  "salary").alias("Total-People")).show()


输出:

示例 1: Python程序将不同部门的工资分组,并通过总和聚合命名为 Employee_salary。 sum()函数在 pyspark.sql.functions 包中可用,因此我们需要导入它。

蟒蛇3

# importing sum function
from pyspark.sql.functions import sum
  
# group the salary among different sectors
# and name  as Employee_salary by sum aggregation
dataframe.groupBy(
  "sector").agg(sum("salary").alias("Employee_salary")).show()

输出:

示例2: Python程序将不同部门的工资分组,并通过平均聚合命名为Average_Employee_salary



蟒蛇3

# importing avg function
from pyspark.sql.functions import avg
  
# group the salary among different sectors
# and name  as Average_Employee_salary
# by average aggregation
dataframe.groupBy("sector") 
.agg(avg(
  "salary").alias("Average_Employee_salary")).show()

输出:

示例 3:将不同部门的工资分组,并按计数聚合命名为 Total-People

蟒蛇3

# importing count function
from pyspark.sql.functions import count
  
# group the salary among different 
# sectors and name  as Total-People
# by count aggregation
dataframe.groupBy("sector") 
.agg(count(
  "salary").alias("Total-People")).show()

输出: