如何命名 PySpark DataFrame 中的聚合列?
在本文中,我们将了解如何在 Pyspark 数据框中命名聚合列。
我们可以通过在 groupBy() 之后使用别名来做到这一点。 groupBy() 用于连接两列并用于聚合列,别名用于更改通过将列中的数据分组而形成的新列的名称。
Syntax: dataframe.groupBy(“column_name1”) .agg(aggregate_function(“column_name2”).alias(“new_column_name”))
Where
- dataframe is the input dataframe
- aggregate function is used to group the column like sum(),avg(),count()
- new_column_name is the name of the new aggregate dcolumn
- alias is the keyword used to get the new column name
创建用于演示的数据框:
Python3
# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data with 10 row values
data =[["1","sravan","IT",45000],
["2","ojaswi","IT",30000],
["3","bobby","business",45000],
["4","rohith","IT",45000],
["5","gnanesh","business",120000],
["6","siva nagulu","sales",23000],
["7","bhanu","sales",34000],
["8","sireesha","business",456798],
["9","ravi","IT",230000],
["10","devi","business",100000],
]
# specify column names
columns=['ID','NAME','sector','salary']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
# display dataframe
dataframe.show()
Python3
# importing sum function
from pyspark.sql.functions import sum
# group the salary among different sectors
# and name as Employee_salary by sum aggregation
dataframe.groupBy(
"sector").agg(sum("salary").alias("Employee_salary")).show()
Python3
# importing avg function
from pyspark.sql.functions import avg
# group the salary among different sectors
# and name as Average_Employee_salary
# by average aggregation
dataframe.groupBy("sector")
.agg(avg(
"salary").alias("Average_Employee_salary")).show()
Python3
# importing count function
from pyspark.sql.functions import count
# group the salary among different
# sectors and name as Total-People
# by count aggregation
dataframe.groupBy("sector")
.agg(count(
"salary").alias("Total-People")).show()
输出:
示例 1: Python程序将不同部门的工资分组,并通过总和聚合命名为 Employee_salary。 sum()函数在 pyspark.sql.functions 包中可用,因此我们需要导入它。
蟒蛇3
# importing sum function
from pyspark.sql.functions import sum
# group the salary among different sectors
# and name as Employee_salary by sum aggregation
dataframe.groupBy(
"sector").agg(sum("salary").alias("Employee_salary")).show()
输出:
示例2: Python程序将不同部门的工资分组,并通过平均聚合命名为Average_Employee_salary
Syntax: avg(“column_name”)
蟒蛇3
# importing avg function
from pyspark.sql.functions import avg
# group the salary among different sectors
# and name as Average_Employee_salary
# by average aggregation
dataframe.groupBy("sector")
.agg(avg(
"salary").alias("Average_Employee_salary")).show()
输出:
示例 3:将不同部门的工资分组,并按计数聚合命名为 Total-People
蟒蛇3
# importing count function
from pyspark.sql.functions import count
# group the salary among different
# sectors and name as Total-People
# by count aggregation
dataframe.groupBy("sector")
.agg(count(
"salary").alias("Total-People")).show()
输出: