使用 Pycaret 的机器学习工作流程

PyCaret 是一个简单易用的开源机器学习库。它可以帮助您从数据准备开始到模型分析和部署结束。此外，它本质上是一个围绕多个机器学习库和框架（如scikit-learn、spaCy等）的Python包装器，它还支持复杂的机器学习算法，这些算法的调整和实现都很繁琐。

那么为什么要使用Pycaret 。好吧，这有很多原因，让我向您解释其中的一些。第一个Pycaret 是一个低代码库，可让您在解决业务问题时提高工作效率。第二个Pycaret 可以用一行代码做数据预处理和特征工程，而实际上它非常耗时。 Third Pycaret 允许您比较不同的机器学习模型并非常轻松地微调您的模型。好吧，还有许多其他优势，但现在，坚持下去。

安装

pip install pycaret

如果您使用的是 Azure Notebooks 或 Google Colab

!pip install pycaret

在本文中，我们将在 Iris 分类数据集上使用 pycaret，您可以在此处下载数据集 https://archive.ics.uci.edu/ml/datasets/iris

让我们从导入所需的库开始。

Python3

# importing required libraries
# for reading and manipulating data
import numpy as np
import pandas as pd

Python3

# reading the data from csv file
iris_classification = pd.read_csv('Iris.csv')
 
# viewing top 5 rows of data
iris_classification.head(5)

Python3

#import classification module from pycaret
from pycaret.classification import *
 
#intialize the setup
clf = setup(iris_classification, target = 'Species')

Python3

# comparing different
# classification models
compare_models()

Python3

# creating model qda
model = create_model('qda')

Python3

# tuning model hyperparameters
tuned_model = tune_model('qda')

Python3

# plotting boundaries between different
# labels
plot_model(tuned_model, plot = 'boundary')

Python3

# plotting confusionmatrix for predicted labels
plot_model(tuned_model, plot = 'confusion_matrix')

Python3

# plotting number of correctly
# classified and misclassifed labels
plot_model(tuned_model, plot = 'error')

Python3

# plotting classification report
plot_model(tuned_model, plot = 'class_report')

Python3

# finalizing the tuned_model
finalize_model(tuned_model)

Python3

# saving the model
save_model(tuned_model, 'qda1')

使用pandas库读取数据集

Python3

# reading the data from csv file
iris_classification = pd.read_csv('Iris.csv')
 
# viewing top 5 rows of data
iris_classification.head(5)

输出：

从 pycaret 开始

初始化设置

Python3

#import classification module from pycaret
from pycaret.classification import *
 
#intialize the setup
clf = setup(iris_classification, target = 'Species')

setup 获取我们的数据 iris_classification 和目标值（需要预测），在我们的例子中它是 Species

输出：

压缩输出

它给出了我们数据集的基本描述，您可以看到它自动将目标变量编码为 0、1、2。

现在让我们对比一下 Pycaret 为我们构建的各种分类模型

Python3

# comparing different
# classification models
compare_models()

输出：

正如我们在此处看到的，它突出显示了每一列中的最高值。对于这个分类，二次判别分析和 Ada Boost 分类器都表现良好，让我们使用 QDA 进行进一步的模型创建和分析。

创建模型

Python3

# creating model qda
model = create_model('qda')

输出：

它显示了用于在不同折叠上评估模型的各种指标。

让我们调整模型超参数

Python3

# tuning model hyperparameters
tuned_model = tune_model('qda')

输出：

我们可以在这里看到，由于我们模型的微调，一些 Recall、Precision、F1 和 Kappa 有所增加。

现在让我们做一些模型分析

Python3

# plotting boundaries between different
# labels
plot_model(tuned_model, plot = 'boundary')

输出：

Python3

# plotting confusionmatrix for predicted labels
plot_model(tuned_model, plot = 'confusion_matrix')

输出：

Python3

# plotting number of correctly
# classified and misclassifed labels
plot_model(tuned_model, plot = 'error')

输出：

Python3

# plotting classification report
plot_model(tuned_model, plot = 'class_report')

输出：

完成模型

Python3

# finalizing the tuned_model
finalize_model(tuned_model)

输出：

保存模型

Python3

# saving the model
save_model(tuned_model, 'qda1')

输出：