如何命名 PySpark DataFrame 中的聚合列？

在本文中，我们将了解如何在 Pyspark 数据框中命名聚合列。

我们可以通过在 groupBy() 之后使用别名来做到这一点。 groupBy() 用于连接两列并用于聚合列，别名用于更改通过将列中的数据分组而形成的新列的名称。

Syntax: dataframe.groupBy(“column_name1”) .agg(aggregate_function(“column_name2”).alias(“new_column_name”))

Where

dataframe is the input dataframe
aggregate function is used to group the column like sum(),avg(),count()
new_column_name is the name of the new aggregate dcolumn
alias is the keyword used to get the new column name

创建用于演示的数据框：

Python3

# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 10 row values
data =[["1","sravan","IT",45000],
       ["2","ojaswi","IT",30000],
       ["3","bobby","business",45000],
       ["4","rohith","IT",45000],
       ["5","gnanesh","business",120000],
       ["6","siva nagulu","sales",23000],
       ["7","bhanu","sales",34000],
       ["8","sireesha","business",456798],
       ["9","ravi","IT",230000],
       ["10","devi","business",100000],
       ]
  
# specify column names
columns=['ID','NAME','sector','salary']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
  
# display dataframe
dataframe.show()

Python3

# importing sum function
from pyspark.sql.functions import sum
  
# group the salary among different sectors
# and name  as Employee_salary by sum aggregation
dataframe.groupBy(
  "sector").agg(sum("salary").alias("Employee_salary")).show()

Python3

# importing avg function
from pyspark.sql.functions import avg
  
# group the salary among different sectors
# and name  as Average_Employee_salary
# by average aggregation
dataframe.groupBy("sector") 
.agg(avg(
  "salary").alias("Average_Employee_salary")).show()

Python3

# importing count function
from pyspark.sql.functions import count
  
# group the salary among different 
# sectors and name  as Total-People
# by count aggregation
dataframe.groupBy("sector") 
.agg(count(
  "salary").alias("Total-People")).show()

输出：

示例 1： Python程序将不同部门的工资分组，并通过总和聚合命名为 Employee_salary。 sum()函数在 pyspark.sql.functions 包中可用，因此我们需要导入它。

蟒蛇3

# importing sum function
from pyspark.sql.functions import sum
  
# group the salary among different sectors
# and name  as Employee_salary by sum aggregation
dataframe.groupBy(
  "sector").agg(sum("salary").alias("Employee_salary")).show()

输出：

示例2： Python程序将不同部门的工资分组，并通过平均聚合命名为Average_Employee_salary

Syntax: avg(“column_name”)

编程需要懂一点英语

蟒蛇3

# importing avg function
from pyspark.sql.functions import avg
  
# group the salary among different sectors
# and name  as Average_Employee_salary
# by average aggregation
dataframe.groupBy("sector") 
.agg(avg(
  "salary").alias("Average_Employee_salary")).show()

输出：

示例 3：将不同部门的工资分组，并按计数聚合命名为 Total-People

蟒蛇3

# importing count function
from pyspark.sql.functions import count
  
# group the salary among different 
# sectors and name  as Total-People
# by count aggregation
dataframe.groupBy("sector") 
.agg(count(
  "salary").alias("Total-People")).show()

输出：