R 编程中的 K 折交叉验证

任何机器学习模型的主要目的都是预测实时数据的结果。为了检查开发的模型是否足以有效地预测看不见的数据点的结果，对应用的机器学习模型进行性能评估变得非常必要。 K-fold 交叉验证技术基本上是一种重新采样数据集以评估机器学习模型的方法。在该技术中，参数 K指的是给定数据集将被拆分成的不同子集的数量。此外， K-1 个子集用于训练模型，剩下的子集用作验证集。

R中K折交叉验证涉及的步骤：

将数据集随机拆分为 K 个子集
对于每个已开发的数据点子集
- 将该子集视为验证集
- 使用所有其余子集进行训练
- 训练模型并在验证集或测试集上对其进行评估
- 计算预测误差
重复上述步骤 K 次，即直到模型没有在所有子集上训练和测试
通过取每种情况下的预测误差的平均值来生成总体预测误差

为了实现 K-fold 方法中涉及的所有步骤，R 语言具有丰富的库和内置函数包，通过它们可以很容易地执行完整的任务。以下是在分类和回归机器学习模型上实施 K 折技术作为交叉验证方法的分步过程。

在分类中实施 K 折技术

当目标变量由分类值（如垃圾邮件、非垃圾邮件、真假等）组成时，分类机器学习模型是首选。这里将使用朴素贝叶斯分类器作为概率分类器来预测目标变量的类标签。

第 1 步：加载数据集和其他所需的包

第一个要求是通过加载所有必需的库和包来设置 R 环境，以执行完整的过程而不会出现任何故障。下面是这一步的实现。

R

# loading required packages
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)
 
# loading package to
# import desired dataset
library(ISLR)

R

# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
 
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
 
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)

R

# setting seed to generate a 
# reproducible random sampling
set.seed(123)
 
# define training control which
# generates parameters that further
# control how models are created
train_control <- trainControl(method = "cv",
                              number = 10)
 
 
# building the model and
# predicting the target variable
# as per the Naive Bayes classifier
model <- train(Direction~., data = dataset,
               trControl = train_control,
               method = "nb")

R

# summarize results of the
# model after calculating
# prediction error in each case
print(model)

R

# loading required packages
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)
 
# installing package to
# import desired dataset
install.packages("datarium")

R

# loading the dataset
data("marketing", package = "datarium")
 
# inspecting the dataset
head(marketing)

R

# setting seed to generate a 
# reproducible random sampling
set.seed(125) 
 
# defining training control
# as cross-validation and 
# value of K equal to 10
train_control <- trainControl(method = "cv",
                              number = 10)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing, 
               method = "lm",
               trControl = train_control)

R

# printing model performance metrics
# along with other details
print(model)

第 2 步：探索数据集

为了对数据集进行操作，首先检查它是非常必要的。它将清楚地了解结构以及数据集中存在的各种数据类型。为此，必须将数据集分配给一个变量。下面是执行相同操作的代码。

R

# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
 
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
 
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)

输出：

Rows: 1,250

Columns: 9

$ Year 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …

$ Lag1 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1…

$ Lag2 -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0…

$ Lag3 -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -…

$ Lag4 -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, …

$ Lag5 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, …

$ Volume 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, …

$ Today 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0…

$ Direction Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up…

> table(dataset$Direction)

Down Up

602 648

编程需要懂一点英语

根据以上信息，数据集包含 250 行 9 列。自变量的数据类型为< dbl> ，来自double，表示双精度浮点数。目标变量是数据类型意味着因子，它是分类模型所需要的。而且，目标变量有两个结果，即Down和Up ，这两个类别的比率几乎是1:1，即它们是平衡的。目标变量的所有类别必须大致相等，才能建立无偏模型。

为此，有许多技术，例如：

下采样
上采样
使用 SMOTE 和 ROSE 的混合采样

第 3 步：使用 K-fold 算法构建模型

在这一步中，定义trainControl()函数来设置K 参数的值，然后按照 K-fold 技术中涉及的步骤开发模型。下面是实现。

R

# setting seed to generate a 
# reproducible random sampling
set.seed(123)
 
# define training control which
# generates parameters that further
# control how models are created
train_control <- trainControl(method = "cv",
                              number = 10)
 
 
# building the model and
# predicting the target variable
# as per the Naive Bayes classifier
model <- train(Direction~., data = dataset,
               trControl = train_control,
               method = "nb")

第 4 步：评估模型的准确性

在模型的训练和验证之后，是时候计算模型的整体准确性了。下面是生成模型摘要的代码。

R

# summarize results of the
# model after calculating
# prediction error in each case
print(model)

输出：

Naive Bayes

1250 samples

8 predictor

2 classes: ‘Down’, ‘Up’

No pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 1125, 1125, 1125, 1126, 1125, 1124, …

Resampling results across tuning parameters:

usekernel Accuracy Kappa

FALSE 0.9543996 0.9083514

TRUE 0.9711870 0.9422498

Tuning parameter ‘fL’ was held constant at a value of 0

Tuning parameter ‘adjust’ was held constant at a value of 1

Accuracy was used to select the optimal model using the largest value.

The final values used for the model were fL = 0, usekernel = TRUE and adjust = 1.

编程需要懂一点英语

实施回归的 K 折技术

回归机器学习模型用于预测具有连续性质的目标变量，如商品价格或公司销售额。以下是在回归模型上实施 K 折交叉验证技术的完整步骤。

第 1 步：导入所有必需的包

通过导入所有必要的包和库来设置 R 环境。下面是这一步的实现。

R

# loading required packages
 
# package to perform data manipulation
# and visualization
library(tidyverse)
 
# package to compute
# cross - validation methods
library(caret)
 
# installing package to
# import desired dataset
install.packages("datarium")

第 2 步：加载和检查数据集

在这一步中，所需的数据集被加载到 R 环境中。之后，打印数据集的一些行以了解其结构。下面是执行此任务的代码。

R

# loading the dataset
data("marketing", package = "datarium")
 
# inspecting the dataset
head(marketing)

输出：

youtube facebook newspaper sales
1  276.12    45.36     83.04 26.52
2   53.40    47.16     54.12 12.48
3   20.64    55.08     83.16 11.16
4  181.80    49.56     70.20 22.20
5  216.96    12.96     70.08 15.48
6   10.44    58.68     90.00  8.64

第 3 步：使用 K-fold 算法构建模型

K 参数的值在trainControl()函数中定义，并根据 K 折交叉验证技术算法中提到的步骤开发模型。下面是实现。

R

# setting seed to generate a 
# reproducible random sampling
set.seed(125) 
 
# defining training control
# as cross-validation and 
# value of K equal to 10
train_control <- trainControl(method = "cv",
                              number = 10)
 
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing, 
               method = "lm",
               trControl = train_control)

第 4 步：评估模型性能

如 K-fold 算法中所述，该模型针对数据集的每个唯一折叠（或子集）进行测试，并且在每种情况下，都会计算预测误差，最后将所有预测误差的平均值视为最终性能模型得分。因此，下面是打印模型的最终分数和总体摘要的代码。

R

# printing model performance metrics
# along with other details
print(model)

输出：

Linear Regression

200 samples

3 predictor

No pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 181, 180, 180, 179, 180, 180, …

Resampling results:

RMSE Rsquared MAE

2.027409 0.9041909 1.539866

Tuning parameter ‘intercept’ was held constant at a value of TRUE

编程需要懂一点英语

K-fold 交叉验证的优点

计算速度快。
一种非常有效的方法来估计模型的预测误差和准确性。

K-fold 交叉验证的缺点

较低的 K 值会导致模型有偏差，而较高的 K 值会导致模型的性能指标发生变化。因此，为模型使用正确的 K 值非常重要（通常 K = 5 和 K = 10 是可取的）。