从头开始实现 AdaBoost 算法

AdaBoost 模型属于一类集成机器学习模型。从“ensemble”这个词的字面量意思，我们可以很容易地对这个模型的工作原理有更好的直觉。集成模型负责组合不同的模型，然后生成一个高级/更准确的元模型。与相应的对应模型相比，该元模型在预测方面具有相对较高的准确性。我们已经在文章 Ensemble Classifier | 中阅读了这些集成模型的工作原理。数据挖掘。

AdaBoost 算法属于集成提升技术，正如所讨论的，它结合了多个模型以产生更准确的结果，这分两个阶段完成：

允许多个弱学习器在训练数据上学习
结合这些模型以生成元模型，该元模型旨在解决单个弱学习器执行的错误。

注意：有关更多信息，请参阅 Boosting ensemble models

在本文中，我们将学习 AdaBoost 分类器在数据集上的实际实现。

在这个问题中，我们给出了一个包含 3 种花和这些花的特征的数据集，如萼片长度、萼片宽度、花瓣长度和花瓣宽度，我们必须将花朵分类为这些物种。数据集可以从这里下载

让我们从导入执行分类任务所需的重要库开始：

Python

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
import warnings
warnings.filterwarnings("ignore")

Python

# Reading the dataset from the csv file
# separator is a vertical line, as seen in the dataset
data = pd.read_csv("Iris.csv")
  
# Printing the shape of the dataset
print(data.shape)

Python

data.head()

Python

data = data.drop('Id',axis=1)
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
print("Shape of X is %s and shape of y is %s"%(X.shape,y.shape))

Python

total_classes = y.nunique()
print("Number of unique species in dataset are: ",total_classes)

Python

distribution = y.value_counts()
print(distribution)

Python

X_train,X_val,Y_train,Y_val = train_test_split(X,y,test_size=0.25,random_state=28)

Python

# Creating adaboost classifier model
adb = AdaBoostClassifier()
adb_model = adb.fit(X_train,Y_train)

Python

print("The accuracy of the model on validation set is", adb_model.score(X_val,Y_val))

之后，导入库，我们将使用 pandas read_csv方法加载我们的数据集，如下所示：

Python

# Reading the dataset from the csv file
# separator is a vertical line, as seen in the dataset
data = pd.read_csv("Iris.csv")
  
# Printing the shape of the dataset
print(data.shape)

(150, 6)

我们可以看到我们的数据集包含 150 行和 6 列。让我们使用head()方法查看数据集中的实际内容：

Python

data.head()

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

第一列是与鲜花无关的 Id 列，因此我们将删除它。 Species 列是我们的目标特征，它告诉我们花朵所属的物种。

Python

data = data.drop('Id',axis=1)
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
print("Shape of X is %s and shape of y is %s"%(X.shape,y.shape))

Shape of X is (150, 4) and shape of y is (150,)

Python

total_classes = y.nunique()
print("Number of unique species in dataset are: ",total_classes)

Number of unique species in dataset are: 3

Python

distribution = y.value_counts()
print(distribution)

Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Species, dtype: int64

让我们深入挖掘我们的数据集，我们可以在上图中看到我们的数据集包含 3 个类，我们的花也分布在这些类中，因为我们有 150 个样本，所有三个物种在数据集中都有相同数量的样本，所以我们有没有阶级不平衡。

现在，我们将拆分数据集用于训练和验证目的，验证集占总数据集的 25%。

Python

X_train,X_val,Y_train,Y_val = train_test_split(X,y,test_size=0.25,random_state=28)

创建训练和验证集后，我们将构建我们的 AdaBoost 分类器模型并将其拟合到训练集以进行学习。

Python

# Creating adaboost classifier model
adb = AdaBoostClassifier()
adb_model = adb.fit(X_train,Y_train)

当我们在训练集上拟合模型时，我们将检查模型在验证集上的准确性。

Python

print("The accuracy of the model on validation set is", adb_model.score(X_val,Y_val))

The accuracy of the model on validation set is 0.9210526315789473

正如我们所看到的，该模型在验证集上的准确率为 92%，在没有超参数调整和特征工程的情况下非常好。