机器学习中的堆叠
堆叠:
Stacking 是一种集成分类或回归模型的方法,它由两层估计器组成。第一层由所有用于预测测试数据集输出的基线模型组成。第二层由元分类器或回归器组成,它将基线模型的所有预测作为输入并生成新的预测。
堆叠架构:
mlxtend:
Mlxtend(机器学习扩展)是用于日常数据科学任务的有用工具的Python库。它包含许多对数据科学和机器学习任务有用的工具,例如:
- 特征选择
- 特征提取
- 可视化
- 合奏
还有很多。
本文介绍了如何在分类数据集上实现 Stacking Classifier。
为什么要堆叠?
大多数机器学习和数据科学竞赛都是通过使用堆叠模型赢得的。它们可以提高单个模型显示的现有准确性。我们可以通过在架构的第一层中选择不同的算法来获得大多数 Stacked 模型,因为不同的算法通过结合两种模型来捕捉训练数据的不同趋势可以提供更好和准确的结果。
在系统上安装库:
pip install mlxtend
pip install pandas
pip install -U scikit-learn
代码:导入所需的库:
python3
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_confusion_matrix
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
python3
df = pd.read_csv('heart.csv') # loading the dataset
df.head() # viewing top 5 rows of dataset
python3
# Creating X and y for training
X = df.drop('target', axis = 1)
y = df['target']
python3
# 20 % training dataset is considered for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
python3
# initializing sc object
sc = StandardScaler()
# variables that needed to be transformed
var_transform = ['thalach', 'age', 'trestbps', 'oldpeak', 'chol']
X_train[var_transform] = sc.fit_transform(X_train[var_transform]) # standardizing training data
X_test[var_transform] = sc.transform(X_test[var_transform]) # standardizing test data
print(X_train.head())
python3
KNC = KNeighborsClassifier() # initialising KNeighbors Classifier
NB = GaussianNB() # initialising Naive Bayes
python3
model_kNeighborsClassifier = KNC.fit(X_train, y_train) # fitting Training Set
pred_knc = model_kNeighborsClassifier.predict(X_test) # Predicting on test dataset
python3
acc_knc = accuracy_score(y_test, pred_knc) # evaluating accuracy score
print('accuracy score of KNeighbors Classifier is:', acc_knc * 100)
python3
model_NaiveBayes = NB.fit(X_train, y_train)
pred_nb = model_NaiveBayes.predict(X_test)
python3
acc_nb = accuracy_score(y_test, pred_nb)
print('Accuracy of Naive Bayes Classifier:', acc_nb * 100)
python3
lr = LogisticRegression() # defining meta-classifier
clf_stack = StackingClassifier(classifiers =[KNC, NB], meta_classifier = lr, use_probas = True, use_features_in_secondary = True)
python3
model_stack = clf_stack.fit(X_train, y_train) # training of stacked model
pred_stack = model_stack.predict(X_test) # predictions on test data using stacked model
python3
acc_stack = accuracy_score(y_test, pred_stack) # evaluating accuracy
print('accuracy score of Stacked model:', acc_stack * 100)
python3
model_stack = clf_stack.fit(X_train, y_train) # training of stacked model
pred_stack = model_stack.predict(X_test) # predictions on test data using stacked model
python3
acc_stack = accuracy_score(y_test, pred_stack) # evaluating accuracy
print('accuracy score of Stacked model:', acc_stack * 100)
代码:加载数据集
蟒蛇3
df = pd.read_csv('heart.csv') # loading the dataset
df.head() # viewing top 5 rows of dataset
输出:
代码:
蟒蛇3
# Creating X and y for training
X = df.drop('target', axis = 1)
y = df['target']
代码:将数据拆分为训练和测试
蟒蛇3
# 20 % training dataset is considered for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
代码:标准化数据
蟒蛇3
# initializing sc object
sc = StandardScaler()
# variables that needed to be transformed
var_transform = ['thalach', 'age', 'trestbps', 'oldpeak', 'chol']
X_train[var_transform] = sc.fit_transform(X_train[var_transform]) # standardizing training data
X_test[var_transform] = sc.transform(X_test[var_transform]) # standardizing test data
print(X_train.head())
输出:
代码:构建第一层估计器
蟒蛇3
KNC = KNeighborsClassifier() # initialising KNeighbors Classifier
NB = GaussianNB() # initialising Naive Bayes
让我们用我们的第一层估计器进行训练和评估,以观察堆叠模型和一般模型的性能差异
代码:训练 KNeighborsClassifier
蟒蛇3
model_kNeighborsClassifier = KNC.fit(X_train, y_train) # fitting Training Set
pred_knc = model_kNeighborsClassifier.predict(X_test) # Predicting on test dataset
代码:KNeighborsClassifier 的评估
蟒蛇3
acc_knc = accuracy_score(y_test, pred_knc) # evaluating accuracy score
print('accuracy score of KNeighbors Classifier is:', acc_knc * 100)
输出:
代码:训练朴素贝叶斯分类器
蟒蛇3
model_NaiveBayes = NB.fit(X_train, y_train)
pred_nb = model_NaiveBayes.predict(X_test)
代码:朴素贝叶斯分类器的评估
蟒蛇3
acc_nb = accuracy_score(y_test, pred_nb)
print('Accuracy of Naive Bayes Classifier:', acc_nb * 100)
输出:
代码:实现堆叠分类器
蟒蛇3
lr = LogisticRegression() # defining meta-classifier
clf_stack = StackingClassifier(classifiers =[KNC, NB], meta_classifier = lr, use_probas = True, use_features_in_secondary = True)
- use_probas=True 表示堆叠分类器使用预测概率作为输入,而不是使用预测类。
- use_features_in_secondary=True 表示 Stacking Classifier 不仅将预测作为输入,还使用数据集中的特征对新数据进行预测。
代码:训练堆叠分类器
蟒蛇3
model_stack = clf_stack.fit(X_train, y_train) # training of stacked model
pred_stack = model_stack.predict(X_test) # predictions on test data using stacked model
代码:评估堆叠分类器
蟒蛇3
acc_stack = accuracy_score(y_test, pred_stack) # evaluating accuracy
print('accuracy score of Stacked model:', acc_stack * 100)
输出:
我们的两个单独模型的准确率接近 80%,Stacked 模型的准确率接近 84%。通过结合两个单独的模型,我们获得了显着的性能提升。
代码:
蟒蛇3
model_stack = clf_stack.fit(X_train, y_train) # training of stacked model
pred_stack = model_stack.predict(X_test) # predictions on test data using stacked model
代码:评估堆叠分类器
蟒蛇3
acc_stack = accuracy_score(y_test, pred_stack) # evaluating accuracy
print('accuracy score of Stacked model:', acc_stack * 100)
输出:
我们的两个单独模型的准确率都接近 80%,而我们的 Stacked 模型的准确率接近 84%。通过结合两个单独的模型,我们获得了显着的性能提升。