Pandas Cut - 连续到分类
在数据分析中经常会看到连续的、高度倾斜的数据等数值数据。有时分析在从连续数据转换为离散数据时变得毫不费力。可以通过多种方式进行转换,其中一种方法是使用 Pandas 的集成剪切功能。 Pandas 的 cut函数是将数值连续数据转换为分类数据的一种杰出方法。它有3个主要的必要部分:
- 首先是输入所需的一维数组/数据帧。
- 另一个主要部分是垃圾箱。表示连续数据的单独 bin 边界的 bin。第一个数字表示 bin 的起点,后面的数字表示 bin 的终点。切割函数允许更明确的箱
- 最后的主要部分是标签。标签的数量无一例外都会比箱的数量少一。
注意:对于任何 NA 值,结果将存储为 NA。在结果分类箱中,越界值也将是 NA。
在使用 pandas cut函数时,它无法保证每个 bin 中值的分布。事实上,我们最终可能会以这样一种方式定义 bin,即 bin 可能不包含任何值。
Syntax: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)
Parameters:
- x: Input array. Need to be 1-dimensional.
- bins: Denotes the bin boundaries for segmentation
- right: Denotes whether rightmost edge of bins should be included or not. Boolean type of value. Default value is True.
- labels: Defines labels for returned segmented bins. Array or boolean
Return Value: Returns a Categorical series/numpy array/IntervalIndex
示例 1:假设我们有一个包含 15 个随机数(从 1 到 100)的数组“年龄”,我们希望将数据分成 4 个类别 -
'Baby/Toddler' :- 0 to 3 years
'Child' :- 4 to 17 years
'Adult' :- 18 to 63 years
'Elderly' :- 64 to 99 years
Python3
# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
# Creating a dummy DataFrame of 15 numbers randomly
# ranging from 1-100 for age
df = pd.DataFrame({'Age': [42, 15, 67, 55, 1, 29, 75, 89, 4,
10, 15, 38, 22, 77]})
# Printing DataFrame Before sorting Continuous
# to Categories
print("Before: ")
print(df)
# A column of name 'Label' is created in DataFrame
# Categorizing Age into 4 Categories
# Baby/Toddler: (0,3], 0 is excluded & 3 is included
# Child: (3,17], 3 is excluded & 17 is included
# Adult: (17,63], 17 is excluded & 63 is included
# Elderly: (63,99], 63 is excluded & 99 is included
df['Label'] = pd.cut(x=df['Age'], bins=[0, 3, 17, 63, 99],
labels=['Baby/Toddler', 'Child', 'Adult',
'Elderly'])
# Printing DataFrame after sorting Continuous to
# Categories
print("After: ")
print(df)
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())
Python3
# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
# Creating a dummy DataFrame of 12 numbers randomly
# ranging from 150-180 for height
df = pd.DataFrame({'Height': [150.4, 157.6, 170, 176, 164.2, 155,
159.2, 175, 162.4, 176, 153, 170.9]})
# Printing DataFrame Before Sorting Continuous to Categories
print("Before: ")
print(df)
# A column of name 'Label' is created in DataFrame
# Categorizing Height into 3 Categories
# Short: (150,157], 150 is excluded & 157 is included
# Average: (157,169], 157 is excluded & 169 is included
# Tall: (169,180], 169 is excluded & 180 is included
df['Label'] = pd.cut(x=df['Height'],
bins=[150, 157, 169, 180],
labels=['Short', 'Average', 'Tall'])
# Printing DataFrame After Sorting Continuous to Categories
print("After: ")
print(df)
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())
输出:
Before:
Age
0 42
1 15
2 67
3 55
4 1
5 29
6 75
7 89
8 4
9 10
10 15
11 38
12 22
13 77
After:
Age Label
0 42 Adult
1 15 Child
2 67 Elderly
3 55 Adult
4 1 Baby/Toddler
5 29 Adult
6 75 Elderly
7 89 Elderly
8 4 Child
9 10 Child
10 15 Child
11 38 Adult
12 22 Adult
13 77 Elderly
Categories:
Adult 5
Elderly 4
Child 4
Baby/Toddler 1
Name: Label, dtype: int64
示例 #2:假设我们有一个包含 12 个随机人的数组“身高”,从 150 厘米到 180 厘米,我们希望将数据分成 3 个类别。
'Short' :- greater than 150cm upto 157cm
'Average' :- greater than 157cm upto 170cm
'Tall' :- greater than 170cm upto 180cm
Python3
# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
# Creating a dummy DataFrame of 12 numbers randomly
# ranging from 150-180 for height
df = pd.DataFrame({'Height': [150.4, 157.6, 170, 176, 164.2, 155,
159.2, 175, 162.4, 176, 153, 170.9]})
# Printing DataFrame Before Sorting Continuous to Categories
print("Before: ")
print(df)
# A column of name 'Label' is created in DataFrame
# Categorizing Height into 3 Categories
# Short: (150,157], 150 is excluded & 157 is included
# Average: (157,169], 157 is excluded & 169 is included
# Tall: (169,180], 169 is excluded & 180 is included
df['Label'] = pd.cut(x=df['Height'],
bins=[150, 157, 169, 180],
labels=['Short', 'Average', 'Tall'])
# Printing DataFrame After Sorting Continuous to Categories
print("After: ")
print(df)
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())
输出:
Before:
Height
0 150.4
1 157.6
2 170.0
3 176.0
4 164.2
5 155.0
6 159.2
7 175.0
8 162.4
9 176.0
10 153.0
11 170.9
After:
Height Label
0 150.4 Short
1 157.6 Average
2 170.0 Tall
3 176.0 Tall
4 164.2 Average
5 155.0 Short
6 159.2 Average
7 175.0 Tall
8 162.4 Average
9 176.0 Tall
10 153.0 Short
11 170.9 Tall
Categories:
Tall 5
Average 4
Short 3
Name: Label, dtype: int64