Pandas Cut - 连续到分类

在数据分析中经常会看到连续的、高度倾斜的数据等数值数据。有时分析在从连续数据转换为离散数据时变得毫不费力。可以通过多种方式进行转换，其中一种方法是使用 Pandas 的集成剪切功能。 Pandas 的 cut函数是将数值连续数据转换为分类数据的一种杰出方法。它有3个主要的必要部分：

首先是输入所需的一维数组/数据帧。
另一个主要部分是垃圾箱。表示连续数据的单独 bin 边界的 bin。第一个数字表示 bin 的起点，后面的数字表示 bin 的终点。切割函数允许更明确的箱
最后的主要部分是标签。标签的数量无一例外都会比箱的数量少一。

注意：对于任何 NA 值，结果将存储为 NA。在结果分类箱中，越界值也将是 NA。

在使用 pandas cut函数时，它无法保证每个 bin 中值的分布。事实上，我们最终可能会以这样一种方式定义 bin，即 bin 可能不包含任何值。

Syntax: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

Parameters:

x: Input array. Need to be 1-dimensional.
bins: Denotes the bin boundaries for segmentation
right: Denotes whether rightmost edge of bins should be included or not. Boolean type of value. Default value is True.
labels: Defines labels for returned segmented bins. Array or boolean

Return Value: Returns a Categorical series/numpy array/IntervalIndex

编程需要懂一点英语

示例 1：假设我们有一个包含 15 个随机数（从 1 到 100）的数组“年龄”，我们希望将数据分成 4 个类别 -

'Baby/Toddler' :- 0 to 3 years
'Child' :- 4 to 17 years
'Adult' :- 18 to 63 years
'Elderly' :- 64 to 99 years

Python3

# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
  
# Creating a dummy DataFrame of 15 numbers randomly
# ranging from 1-100 for age
df = pd.DataFrame({'Age': [42, 15, 67, 55, 1, 29, 75, 89, 4,
                           10, 15, 38, 22, 77]})
  
# Printing DataFrame Before sorting Continuous 
# to Categories
print("Before: ")
print(df)
  
# A column of name 'Label' is created in DataFrame
# Categorizing Age into 4 Categories
# Baby/Toddler: (0,3], 0 is excluded & 3 is included
# Child: (3,17], 3 is excluded & 17 is included
# Adult: (17,63], 17 is excluded & 63 is included
# Elderly: (63,99], 63 is excluded & 99 is included
df['Label'] = pd.cut(x=df['Age'], bins=[0, 3, 17, 63, 99],
                     labels=['Baby/Toddler', 'Child', 'Adult',
                             'Elderly'])
  
# Printing DataFrame after sorting Continuous to
# Categories
print("After: ")
print(df)
  
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())

Python3

# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
  
# Creating a dummy DataFrame of 12 numbers randomly
# ranging from 150-180 for height
df = pd.DataFrame({'Height': [150.4, 157.6, 170, 176, 164.2, 155,
                              159.2, 175, 162.4, 176, 153, 170.9]})
  
# Printing DataFrame Before Sorting Continuous to Categories
print("Before: ")
print(df)
  
# A column of name 'Label' is created in DataFrame
# Categorizing Height into 3 Categories
# Short: (150,157], 150 is excluded & 157 is included
# Average: (157,169], 157 is excluded & 169 is included
# Tall: (169,180], 169 is excluded & 180 is included
df['Label'] = pd.cut(x=df['Height'],
                     bins=[150, 157, 169, 180],
                     labels=['Short', 'Average', 'Tall'])
  
# Printing DataFrame After Sorting Continuous to Categories
print("After: ")
print(df)
  
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())

输出：

Before: 
    Age
0    42
1    15
2    67
3    55
4     1
5    29
6    75
7    89
8     4
9    10
10   15
11   38
12   22
13   77
After: 
    Age         Label
0    42         Adult
1    15         Child
2    67       Elderly
3    55         Adult
4     1  Baby/Toddler
5    29         Adult
6    75       Elderly
7    89       Elderly
8     4         Child
9    10         Child
10   15         Child
11   38         Adult
12   22         Adult
13   77       Elderly
Categories: 
Adult           5
Elderly         4
Child           4
Baby/Toddler    1
Name: Label, dtype: int64

示例 #2：假设我们有一个包含 12 个随机人的数组“身高”，从 150 厘米到 180 厘米，我们希望将数据分成 3 个类别。

'Short' :- greater than 150cm upto 157cm
'Average' :- greater than 157cm upto 170cm
'Tall' :- greater than 170cm upto 180cm

Python3

# Importing pandas and numpy libraries
import pandas as pd
import numpy as np
  
# Creating a dummy DataFrame of 12 numbers randomly
# ranging from 150-180 for height
df = pd.DataFrame({'Height': [150.4, 157.6, 170, 176, 164.2, 155,
                              159.2, 175, 162.4, 176, 153, 170.9]})
  
# Printing DataFrame Before Sorting Continuous to Categories
print("Before: ")
print(df)
  
# A column of name 'Label' is created in DataFrame
# Categorizing Height into 3 Categories
# Short: (150,157], 150 is excluded & 157 is included
# Average: (157,169], 157 is excluded & 169 is included
# Tall: (169,180], 169 is excluded & 180 is included
df['Label'] = pd.cut(x=df['Height'],
                     bins=[150, 157, 169, 180],
                     labels=['Short', 'Average', 'Tall'])
  
# Printing DataFrame After Sorting Continuous to Categories
print("After: ")
print(df)
  
# Check the number of values in each bin
print("Categories: ")
print(df['Label'].value_counts())

输出：

Before: 
    Height
0    150.4
1    157.6
2    170.0
3    176.0
4    164.2
5    155.0
6    159.2
7    175.0
8    162.4
9    176.0
10   153.0
11   170.9
After: 
    Height    Label
0    150.4    Short
1    157.6  Average
2    170.0     Tall
3    176.0     Tall
4    164.2  Average
5    155.0    Short
6    159.2  Average
7    175.0     Tall
8    162.4  Average
9    176.0     Tall
10   153.0    Short
11   170.9     Tall
Categories: 
Tall       5
Average    4
Short      3
Name: Label, dtype: int64