毫升 |在Python中使用 SMOTE 和 Near Miss 算法处理不平衡数据

在机器学习和数据科学中，我们经常遇到一个称为不平衡数据分布的术语，通常发生在其中一个类中的观察值远高于或低于其他类时。由于机器学习算法倾向于通过减少误差来提高准确性，因此它们不考虑类分布。这个问题在欺诈检测、异常检测、面部识别等示例中很普遍。

决策树和逻辑回归等标准 ML 技术偏向于多数类，并且倾向于忽略少数类。他们倾向于只预测多数类别，因此，与多数类别相比，少数类别存在重大错误分类。用更专业的话说，如果我们的数据集中的数据分布不平衡，那么我们的模型更容易出现少数类的召回率可以忽略不计或非常少的情况。

不平衡数据处理技术：主要有两种主要的算法被广泛用于处理不平衡的类分布。

SMOTE
差点错过算法

SMOTE（合成少数过采样技术）——过采样

SMOTE（合成少数过采样技术）是解决不平衡问题最常用的过采样方法之一。
它旨在通过复制少数类示例随机增加少数类示例来平衡类分布。
SMOTE 在现有少数实例之间合成新的少数实例。它通过对少数类的线性插值生成虚拟训练记录。这些合成训练记录是通过为少数类中的每个示例随机选择一个或多个 k 最近邻生成的。在过采样过程之后，数据被重构并且可以对处理后的数据应用多个分类模型。
更深入地了解 SMOTE 算法的工作原理！

步骤 1：设置少数类集A ，对于每个 $x \in A$ , x 的k 最近邻是通过计算x与集合A中的每个其他样本之间的欧几里得距离来获得的。

步骤2：根据不平衡比例设置采样率N。对于每个 $x \in A$ , N个样本（即 x1, x2, …xn）从它的 k 个最近邻中随机选择，并构建集合 $A_1$ .

第 3 步：对于每个示例 $x_k \in A_1$ (k=1, 2, 3…N)，下面的公式用来生成一个新的例子：
$x' = x + rand(0, 1) * \mid x - x_k \mid$
其中 rand(0, 1) 表示 0 到 1 之间的随机数。

NearMiss 算法 – 欠采样

NearMiss 是一种欠采样技术。它旨在通过随机消除多数类示例来平衡类分布。当两个不同类的实例彼此非常接近时，我们删除多数类的实例以增加两个类之间的空间。这有助于分类过程。
为了防止大多数欠采样技术中的信息丢失问题，广泛使用近邻方法。
关于近邻方法工作的基本直觉如下：

步骤 1：该方法首先找到多数类的所有实例与少数类的实例之间的距离。在这里，多数类将被欠采样。

步骤2：然后，选择与少数类中距离最小的多数类的n个实例。

步骤 3：如果少数类中有 k 个实例，则最近的方法将导致多数类的k*n 个实例。

为了在多数类中找到 n 个最接近的实例，有几种应用 NearMiss 算法的变体：

NearMiss – 版本 1：它选择多数类的样本，其中与少数类的 k 个最近实例的平均距离最小。
NearMiss – 第 2 版：它选择多数类的样本，其中与少数类的 k 个最远实例的平均距离最小。
NearMiss – 第 3 版：它分两步工作。首先，对于每个少数类实例，将存储它们的M 个最近邻。最后，选择与 N 个最近邻的平均距离最大的多数类实例。

本文有助于更好地理解和实践如何在不同的不平衡数据处理技术之间做出最佳选择。

加载库和数据文件

该数据集由信用卡进行的交易组成。该数据集有284、807 笔交易中的 492 笔欺诈交易。这使得它非常不平衡，正面类别（欺诈）占所有交易的 0.172%。
数据集可以从这里下载。

# import necessary modules 
import pandas  as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
  
# load the data set
data = pd.read_csv('creditcard.csv')
  
# print info about columns in the dataframe
print(data.info())

输出：

RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64

# normalise the amount column
data['normAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-1, 1))
  
# drop Time and Amount columns as they are not relevant for prediction purpose 
data = data.drop(['Time', 'Amount'], axis = 1)
  
# as you can see there are 492 fraud transactions.
data['Class'].value_counts()

输出：

0    284315
       1       492

将数据拆分为测试集和训练集

from sklearn.model_selection import train_test_split
  
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
  
# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

输出：

Number transactions X_train dataset:  (199364, 29)
      Number transactions y_train dataset:  (199364, 1)
      Number transactions X_test dataset:  (85443, 29)
      Number transactions y_test dataset:  (85443, 1)

现在在不处理不平衡类分布的情况下训练模型

# logistic regression object
lr = LogisticRegression()
  
# train the model on train set
lr.fit(X_train, y_train.ravel())
  
predictions = lr.predict(X_test)
  
# print classification report
print(classification_report(y_test, predictions))

输出：

precision   recall   f1-score  support

           0       1.00      1.00      1.00     85296
           1       0.88      0.62      0.73       147

    accuracy                           1.00     85443
   macro avg       0.94      0.81      0.86     85443
weighted avg       1.00      1.00      1.00     85443

准确率达到 100%，但你有没有注意到一些奇怪的事情？
少数类的召回率非常低。这证明了该模型更偏向于多数类。因此，它证明这不是最好的模型。
现在，我们将应用不同的不平衡数据处理技术并查看它们的准确性和召回结果。

使用 SMOTE 算法

您可以从这里检查所有参数。

print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
  
# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
  
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0)))

输出：

Before OverSampling, counts of label '1': [345]
Before OverSampling, counts of label '0': [199019] 

After OverSampling, the shape of train_X: (398038, 29)
After OverSampling, the shape of train_y: (398038, ) 

After OverSampling, counts of label '1': 199019
After OverSampling, counts of label '0': 199019

看！ SMOTE 算法对少数实例进行了过采样并使其等于多数类。两个类别的记录数量相等。更具体地说，少数类已增加到多数类的总数。
现在查看应用 SMOTE 算法（过采样）后的准确率和召回结果。

预测和召回

lr1 = LogisticRegression()
lr1.fit(X_train_res, y_train_res.ravel())
predictions = lr1.predict(X_test)
  
# print classification report
print(classification_report(y_test, predictions))

输出：

precision   recall   f1-score  support

           0       1.00      0.98      0.99     85296
           1       0.06      0.92      0.11       147

    accuracy                           0.98     85443
   macro avg       0.53      0.95      0.55     85443
weighted avg       1.00      0.98      0.99     85443

哇，与之前的模型相比，我们将准确率降低到 98%，但少数类的召回值也提高到了 92%。与之前的模型相比，这是一个很好的模型。回忆很棒。
现在，我们将应用 NearMiss 技术对多数类进行欠采样，并查看其准确性和召回结果。

NearMiss算法：

您可以从这里检查所有参数。

print("Before Undersampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label '0': {} \n".format(sum(y_train == 0)))
  
# apply near miss
from imblearn.under_sampling import NearMiss
nr = NearMiss()
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train.ravel())
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape))
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape))
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1)))
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0)))

输出：

Before Undersampling, counts of label '1': [345]
Before Undersampling, counts of label '0': [199019] 

After Undersampling, the shape of train_X: (690, 29)
After Undersampling, the shape of train_y: (690, ) 

After Undersampling, counts of label '1': 345
After Undersampling, counts of label '0': 345

NearMiss 算法对多数实例进行了欠采样，使其等于多数类。在这里，多数类已减少到少数类的总数，因此两个类将具有相同数量的记录。

预测和召回

# train the model on train set
lr2 = LogisticRegression()
lr2.fit(X_train_miss, y_train_miss.ravel())
predictions = lr2.predict(X_test)
  
# print classification report
print(classification_report(y_test, predictions))

输出：

precision    recall   f1-score   support

           0       1.00      0.56      0.72     85296
           1       0.00      0.95      0.01       147

    accuracy                           0.56     85443
   macro avg       0.50      0.75      0.36     85443
weighted avg       1.00      0.56      0.72     85443

这个模型比第一个模型更好，因为它分类得更好，而且少数类的召回值为 95%。但由于多数类的抽样不足，其召回率已降至 56%。所以在这种情况下，SMOTE 给了我很好的准确性和召回率，我会继续使用那个模型！ 🙂