如何展平 Pandas DataFrame 列中的分层索引?
在本文中,我们将看到在 Pandas DataFrame 列中展平分层索引。层次索引通常是 groupby 聚合函数的结果。使用的聚合函数将出现在结果数据帧的层次索引中。
方法一:使用 reset_index()函数
Pandas 提供了一个名为 reset_index() 的函数来展平由于 groupby 聚合函数而创建的层次索引。
Syntax: pandas.DataFrame.reset_index(level, drop, inplace)
Parameters:
- level – removes only the specified levels from the index
- drop – resets the index to the default integer index
- inplace – modifies the dataframe object permanently without creating a copy.
例子:
在此示例中,我们使用 pandas groupby函数按季度对汽车销售数据进行分组,并使用 reset_index() pandas函数来展平分组数据帧的分层索引列。
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the sum
# of sales on quarter 1 and 2
grouped_data = data.groupby(by="cars").agg("sum")
print(grouped_data)
# use reset_index to flattened
# the hierarchical dataframe.
flat_data = grouped_data.reset_index()
print(flat_data)
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the
# sum of sales on quarter 1 and 2
# and mention as_index is False
grouped_data = data.groupby(by="cars", as_index=False).agg("sum")
# display
print(grouped_data)
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the sum and max of sales on quarter 1
# and sum and min of sales 2 and mention as_index is False
grouped_data = data.groupby(by="cars").agg({"sale_q1 in Cr": [sum, max],
'sale_q2 in Cr': [sum, min]})
# create an empty list to save the
# names of the flattened columns
flat_cols = []
# the multiindex columns of two
# levels would be stored as tuples
# iterate through this tuples and
# join them as single string
for i in grouped_data.columns:
flat_cols.append(i[0]+'_'+i[1])
# now assign the list of flattened
# columns to the grouped columns.
grouped_data.columns = flat_cols
# print the grouped data
print(grouped_data)
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the sum
# and max of sales on quarter 1
# and sum and min of sales 2 and mention
# as_index is False
grouped_data = data.groupby(by="cars").agg({"sale_q1 in Cr": [sum, max],
'sale_q2 in Cr': [sum, min]})
# use to_records function on grouped data
# and pass this to the Dataframe function
flattened_data = pd.DataFrame(grouped_data.to_records())
print(flattened_data)
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the sum
# and max of sales on quarter 1
# and sum and min of sales 2 and
# mention as_index is False
grouped_data = data.groupby(by="cars").agg({"sale_q1 in Cr": [sum, max],
'sale_q2 in Cr': [sum, min]})
# use join() and rstrip() function to
# flatten the hierarchical columns
grouped_data.columns = ['_'.join(i).rstrip('_')
for i in grouped_data.columns.values]
print(grouped_data)
输出:
方法二:使用 as_index()函数
Pandas 提供了一个名为as_index()的函数,该函数由布尔值指定。 as_index() 函数按指定的聚合函数对数据帧进行分组,如果 as_index() 值为 False,则结果数据帧将被展平。
Syntax: pandas.DataFrame.groupby(by, level, axis, as_index)
Parameters:
- by – specifies the columns on which the groupby operation has to be performed
- level – specifies the index at which the columns has to be grouped
- axis – specifies whether to split along rows (0) or columns (1)
- as_index – Returns an object with group labels as the index, for aggregated output.
例子:
在此示例中,我们使用 pandas groupby函数按季度对汽车销售数据进行分组,并将 as_index 参数指定为 False 并将 as_index 参数指定为 false 以确保分组数据帧的层次索引是扁平的。
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the
# sum of sales on quarter 1 and 2
# and mention as_index is False
grouped_data = data.groupby(by="cars", as_index=False).agg("sum")
# display
print(grouped_data)
输出:
方法 3:使用 groupby 扁平化 pandas 数据框中的层次索引
每当我们在具有多个聚合函数的单个列上使用 groupby函数时,我们都会根据聚合类型获得多个层次索引。在这种情况下,分层索引必须在两个级别上都被展平。
Syntax: pandas.DataFrame.groupby(by=None, axis=0, level=None)
Explanation:
- by – mapping function that determines the groups in groupby function
- axis – 0 – splits along rows and 1 – splits along columns.
- level – if the axis is multi-indexed, groups at a specified level. (int)
Syntax: pandas.DataFrame.agg(func=None, axis=0)
Explanation:
- func – specifies the function to be used as aggregation function. (min, max, sum etc)
- axis – 0 – function applied to each column and 1- applied to each row.
方法:
- 导入Python pandas 包。
- 创建一个示例数据框,显示第一季度和第二季度两个季度的汽车销量,如图所示。
- 现在使用 pandas groupby函数根据第一季度销售额的总和和最大值以及销售额 2 的总和和最小值进行分组。
- 分组数据帧具有存储在元组列表中的多索引列。使用 for 循环遍历元组列表并将它们连接为单个字符串。
- 在 flat_cols 列表中附加连接的字符串。
- 现在将 flat_cols 列表分配给多索引分组数据框列的列名。
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the sum and max of sales on quarter 1
# and sum and min of sales 2 and mention as_index is False
grouped_data = data.groupby(by="cars").agg({"sale_q1 in Cr": [sum, max],
'sale_q2 in Cr': [sum, min]})
# create an empty list to save the
# names of the flattened columns
flat_cols = []
# the multiindex columns of two
# levels would be stored as tuples
# iterate through this tuples and
# join them as single string
for i in grouped_data.columns:
flat_cols.append(i[0]+'_'+i[1])
# now assign the list of flattened
# columns to the grouped columns.
grouped_data.columns = flat_cols
# print the grouped data
print(grouped_data)
输出:
方法 4:使用 to_records()函数展平层次索引
在这个例子中,我们使用了 pandas 数据帧的to_records()函数,它将数据帧中的所有行转换为一个元组数组。然后将此元组数组传递给pandas.DataFrame函数以将分层索引转换为扁平列。
Syntax: pandas.DataFrame.to_records(index=True, column_dtypes=None)
Explanation:
- index – creates an index in resulting array
- column_dtypes – sets the columns to specified datatype.
代码:
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the sum
# and max of sales on quarter 1
# and sum and min of sales 2 and mention
# as_index is False
grouped_data = data.groupby(by="cars").agg({"sale_q1 in Cr": [sum, max],
'sale_q2 in Cr': [sum, min]})
# use to_records function on grouped data
# and pass this to the Dataframe function
flattened_data = pd.DataFrame(grouped_data.to_records())
print(flattened_data)
输出:
方法 5:使用 join() 和 rstrip() 展平分层列
在此示例中,我们使用 join() 和 rstrip() 函数来展平列。通常,当我们将数据框分组为分层索引列时,多级列存储为元组元素数组。在这里,我们通过连接每个元组的列名和索引名并将生成的扁平列名存储在列表中来遍历这些元组。稍后,将此存储的展平列列表分配给分组数据框。
Syntax: str.join(iterable)
Explanation: Returns a concatenated string, if iterable, else returns a type error.
Syntax: str.rstrip([chars])
Explanation: Returns a string by splitting the excess trailing spaces (rightmost) to the string.
代码:
Python3
# import the python pandas package
import pandas as pd
# create a sample dataframe
data = pd.DataFrame({"cars": ["bmw", "bmw", "benz", "benz"],
"sale_q1 in Cr": [20, 22, 24, 26],
'sale_q2 in Cr': [11, 13, 15, 17]},
columns=["cars", "sale_q1 in Cr",
'sale_q2 in Cr'])
# group by cars based on the sum
# and max of sales on quarter 1
# and sum and min of sales 2 and
# mention as_index is False
grouped_data = data.groupby(by="cars").agg({"sale_q1 in Cr": [sum, max],
'sale_q2 in Cr': [sum, min]})
# use join() and rstrip() function to
# flatten the hierarchical columns
grouped_data.columns = ['_'.join(i).rstrip('_')
for i in grouped_data.columns.values]
print(grouped_data)
输出: