Python|数据平滑的分箱方法
先决条件:机器学习 |分箱或离散分箱方法用于平滑数据或处理噪声数据。在这种方法中,首先对数据进行排序,然后将排序后的值分配到多个桶或箱中。当分箱方法参考值的邻域时,它们执行局部平滑。执行平滑的三种方法 -
Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin median : In this method each bin value is replaced by its bin median value. Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
方法:
- 对给定数据集的数组进行排序。
- 将范围划分为 N 个区间,每个区间包含大致相同数量的样本(等深分区)。
- 在每一行中存储平均值/中值/边界。
例子:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition using equal frequency approach:
- Bin 1 : 4, 8, 9, 15
- Bin 2 : 21, 21, 24, 25
- Bin 3 : 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Smoothing by bin median:
- Bin 1: 9 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
下面是上述算法的Python实现——
Python3
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics
# load iris data set
dataset = load_iris()
a = dataset.data
b = np.zeros(150)
# take 1st column among 4 column of data set
for i in range (150):
b[i]=a[i,1]
b=np.sort(b) #sort the array
# create bins
bin1=np.zeros((30,5))
bin2=np.zeros((30,5))
bin3=np.zeros((30,5))
# Bin mean
for i in range (0,150,5):
k=int(i/5)
mean=(b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4])/5
for j in range(5):
bin1[k,j]=mean
print("Bin Mean: \n",bin1)
# Bin boundaries
for i in range (0,150,5):
k=int(i/5)
for j in range (5):
if (b[i+j]-b[i]) < (b[i+4]-b[i+j]):
bin2[k,j]=b[i]
else:
bin2[k,j]=b[i+4]
print("Bin Boundaries: \n",bin2)
# Bin median
for i in range (0,150,5):
k=int(i/5)
for j in range (5):
bin3[k,j]=b[i+2]
print("Bin Median: \n",bin3)