数据挖掘中的分箱
数据分箱、分桶是一种数据预处理方法,用于最小化小观测误差的影响。原始数据值被划分为称为 bin 的小区间,然后用为该 bin 计算的一般值替换它们。这对输入数据有平滑作用,并且在小数据集的情况下还可以减少过度拟合的机会
将数据划分为 bin 有 2 种方法:
- 等频分箱:分箱具有相等的频率。
- 等宽分箱:分箱具有相等的宽度,每个分箱的范围定义为 [min + w], [min + 2w] ...。 [min + nw] 其中w = (max – min) / (no of bins)。
等频:
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
等宽:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
代码:分箱技术的实现:
PY
# equal frequency
def equifreq(arr1, m):
a = len(arr1)
n = int(a / m)
for i in range(0, m):
arr = []
for j in range(i * n, (i + 1) * n):
if j >= a:
break
arr = arr + [arr1[j]]
print(arr)
# equal width
def equiwidth(arr1, m):
a = len(arr1)
w = int((max(arr1) - min(arr1)) / m)
min1 = min(arr1)
arr = []
for i in range(0, m + 1):
arr = arr + [min1 + w * i]
arri=[]
for i in range(0, m):
temp = []
for j in arr1:
if j >= arr[i] and j <= arr[i+1]:
temp += [j]
arri += [temp]
print(arri)
# data to be binned
data = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
# no of bins
m = 3
print("equal frequency binning")
equifreq(data, m)
print("\n\nequal width binning")
equiwidth(data, 3)
输出 :
equal frequency binning
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
equal width binning
[[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]]