Python|数据平滑的分箱方法

先决条件：机器学习 |分箱或离散分箱方法用于平滑数据或处理噪声数据。在这种方法中，首先对数据进行排序，然后将排序后的值分配到多个桶或箱中。当分箱方法参考值的邻域时，它们执行局部平滑。执行平滑的三种方法 -

Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin median : In this method each bin value is replaced by its bin median value. Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.

编程需要懂一点英语

方法：

对给定数据集的数组进行排序。
将范围划分为 N 个区间，每个区间包含大致相同数量的样本（等深分区）。
在每一行中存储平均值/中值/边界。

例子：

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition using equal frequency approach:
      - Bin 1 : 4, 8, 9, 15
      - Bin 2 : 21, 21, 24, 25
      - Bin 3 : 26, 28, 29, 34

Smoothing by bin means:
      - Bin 1: 9, 9, 9, 9
      - Bin 2: 23, 23, 23, 23
      - Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:
      - Bin 1: 4, 4, 4, 15
      - Bin 2: 21, 21, 25, 25
      - Bin 3: 26, 26, 26, 34

Smoothing by bin median:
      - Bin 1: 9 9, 9, 9
      - Bin 2: 23, 23, 23, 23
      - Bin 3: 29, 29, 29, 29

下面是上述算法的Python实现——

Python3

import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics
 
# load iris data set
dataset = load_iris()
a = dataset.data
b = np.zeros(150)
 
# take 1st column among 4 column of data set
for i in range (150):
    b[i]=a[i,1]
 
b=np.sort(b) #sort the array
 
# create bins
bin1=np.zeros((30,5))
bin2=np.zeros((30,5))
bin3=np.zeros((30,5))
 
# Bin mean
for i in range (0,150,5):
    k=int(i/5)
    mean=(b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4])/5
    for j in range(5):
        bin1[k,j]=mean
print("Bin Mean: \n",bin1)
     
# Bin boundaries
for i in range (0,150,5):
    k=int(i/5)
    for j in range (5):
        if (b[i+j]-b[i]) < (b[i+4]-b[i+j]):
            bin2[k,j]=b[i]
        else:
            bin2[k,j]=b[i+4]   
print("Bin Boundaries: \n",bin2)
 
# Bin median
for i in range (0,150,5):
    k=int(i/5)
    for j in range (5):
        bin3[k,j]=b[i+2]
print("Bin Median: \n",bin3)