R编程中的交叉验证
设计机器学习模型的主要挑战是使其在看不见的数据上准确工作。要知道设计的模型是否工作正常,我们必须针对那些在模型训练期间不存在的数据点对其进行测试。这些数据点将为模型提供看不见的数据,并且很容易评估模型的准确性。检查机器学习模型有效性的最佳技术之一是交叉验证技术,可以通过使用 R 编程语言轻松实现。在这种情况下,保留了一部分数据集,不会用于训练模型。模型准备好后,该保留数据集将用于测试目的。在测试阶段预测因变量的值,并根据预测误差(即因变量的实际值和预测值的差异)计算模型精度。有几个统计指标可用于评估回归模型的准确性:
- 均方根误差 (RMSE) :顾名思义,它是目标变量的实际值和预测值之间的平均平方差的平方根。它给出了模型做出的平均预测误差,从而降低了RMSE值以提高模型的准确性。
- 平均绝对误差 (MAE):该指标给出了目标变量的实际值与模型预测值之间的绝对差值。如果异常值的值与模型的准确性没有太大关系,那么可以使用 MAE 来评估模型的性能。为了做出更好的模型,它的值必须更小。
- R 2误差: R 平方度量的值给出了一个概念,即因变量的方差百分比由自变量共同解释。换句话说,它反映了目标变量与模型之间的关系强度,范围为 0 – 100%。因此,更好的模型应该具有较高的 R 平方值。
交叉验证的类型
在将完整数据集划分为训练集和验证集的过程中,可能会丢失一些用于训练目的的重要和关键数据点。由于这些数据不包含在训练集中,因此模型没有机会检测到某些模式。这种情况会导致模型过拟合或欠拟合。为了避免这种情况,有不同类型的交叉验证技术可以保证训练和验证数据集的随机抽样,并最大限度地提高模型的准确性。一些最流行的交叉验证技术是
- 验证集方法
- 留一交叉验证(LOOCV)
- K折交叉验证
- 重复 K 折交叉验证
加载数据集
为了实现线性回归,我们使用了一个营销数据集,它是 R 编程语言的内置数据集。以下是将这个数据集导入您的 R 编程环境的代码。
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# installing package to
# import desired dataset
install.packages("datarium")
# loading the dataset
data("marketing", package = "datarium")
# inspecting the dataset
head(marketing)
R
# R program to implement
# validation set approach
# setting seed to generate a
# reproducible random sampling
set.seed(123)
# creating training data as 80% of the dataset
random_sample <- createDataPartition(marketing $ sales,
p = 0.8, list = FALSE)
# generating training dataset
# from the random_sample
training_dataset <- marketing[random_sample, ]
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- marketing[-random_sample, ]
# Building the model
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(sales ~., data = training_dataset)
# predicting the target variable
predictions <- predict(model, testing_dataset)
# computing model performance metrics
data.frame( R2 = R2(predictions, testing_dataset $ sales),
RMSE = RMSE(predictions, testing_dataset $ sales),
MAE = MAE(predictions, testing_dataset $ sales))
R
# R program to implement
# Leave one out cross validation
# defining training control
# as Leave One Out Cross Validation
train_control <- trainControl(method = "LOOCV")
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
R
# R program to implement
# K-fold cross-validation
# setting seed to generate a
# reproducible random sampling
set.seed(125)
# defining training control
# as cross-validation and
# value of K equal to 10
train_control <- trainControl(method = "cv",
number = 10)
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
R
# R program to implement
# repeated K-fold cross-validation
# setting seed to generate a
# reproducible random sampling
set.seed(125)
# defining training control as
# repeated cross-validation and
# value of K is 10 and repetition is 3 times
train_control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
输出:
youtube facebook newspaper sales
1 276.12 45.36 83.04 26.52
2 53.40 47.16 54.12 12.48
3 20.64 55.08 83.16 11.16
4 181.80 49.56 70.20 22.20
5 216.96 12.96 70.08 15.48
6 10.44 58.68 90.00 8.64
验证集方法(或数据拆分)
在这种方法中,数据集被随机分为训练集和测试集。执行以下步骤来实现此技术:
- 数据集的随机抽样
- 在训练数据集上训练模型
- 将所得模型应用于测试数据集
- 使用模型性能指标计算预测误差
下面是这个方法的实现:
R
# R program to implement
# validation set approach
# setting seed to generate a
# reproducible random sampling
set.seed(123)
# creating training data as 80% of the dataset
random_sample <- createDataPartition(marketing $ sales,
p = 0.8, list = FALSE)
# generating training dataset
# from the random_sample
training_dataset <- marketing[random_sample, ]
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- marketing[-random_sample, ]
# Building the model
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(sales ~., data = training_dataset)
# predicting the target variable
predictions <- predict(model, testing_dataset)
# computing model performance metrics
data.frame( R2 = R2(predictions, testing_dataset $ sales),
RMSE = RMSE(predictions, testing_dataset $ sales),
MAE = MAE(predictions, testing_dataset $ sales))
输出:
R2 RMSE MAE
1 0.9049049 1.965508 1.433609
好处:
- 评估模型的最基本和最简单的技术之一。
- 没有复杂的实施步骤。
缺点:
- 模型所做的预测高度依赖于用于训练和验证的观察子集。
- 仅将数据的一个子集用于训练目的会使模型产生偏差。
留一交叉验证(LOOCV)
该方法还将数据集分成两部分,但它克服了验证集方法的缺点。 LOOCV 通过以下方式进行交叉验证:
- 在 N-1 个数据点上训练模型
- 针对上一步中留下的那个数据点测试模型
- 计算预测误差
- 重复以上 3 个步骤,直到模型没有在所有数据点上训练和测试
- 通过取每种情况下的预测误差的平均值来生成总体预测误差
下面是这个方法的实现:
R
# R program to implement
# Leave one out cross validation
# defining training control
# as Leave One Out Cross Validation
train_control <- trainControl(method = "LOOCV")
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
输出:
Linear Regression
200 samples
3 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ...
Resampling results:
RMSE Rsquared MAE
2.059984 0.8912074 1.539441
Tuning parameter 'intercept' was held constant at a value of TRUE
好处:
- 更少的偏差模型,因为几乎每个数据点都用于训练。
- 性能指标的值没有随机性,因为 LOOCV 在数据集上运行多次
缺点:
- 如果数据集很大,训练模型 N 次会导致昂贵的计算时间。
K折交叉验证
这种交叉验证技术将数据分成大小几乎相等的 K 个子集(折叠)。在这 K 个折叠中,一个子集用作验证集,其余的则参与模型的训练。以下是该方法的完整工作过程:
- 将数据集随机拆分为 K 个子集
- 使用 K-1 子集训练模型
- 针对上一步中留下的一个子集测试模型
- 重复上述步骤 K 次,即直到模型没有在所有子集上训练和测试
- 通过取每种情况下的预测误差的平均值来生成总体预测误差
下面是这个方法的实现:
R
# R program to implement
# K-fold cross-validation
# setting seed to generate a
# reproducible random sampling
set.seed(125)
# defining training control
# as cross-validation and
# value of K equal to 10
train_control <- trainControl(method = "cv",
number = 10)
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
输出:
Linear Regression
200 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ...
Resampling results:
RMSE Rsquared MAE
2.027409 0.9041909 1.539866
Tuning parameter 'intercept' was held constant at a value of TRUE
好处:
- 计算速度快。
- 一种非常有效的方法来估计模型的预测误差和准确性。
缺点:
- 较低的 K 值会导致模型有偏差,而较高的 K 值会导致模型的性能指标发生变化。因此,为模型使用正确的 K 值非常重要(通常 K = 5 和 K = 10 是可取的)。
重复 K 折交叉验证:
顾名思义,在这种方法中,K-fold 交叉验证算法会重复一定次数。下面是这个方法的实现:
R
# R program to implement
# repeated K-fold cross-validation
# setting seed to generate a
# reproducible random sampling
set.seed(125)
# defining training control as
# repeated cross-validation and
# value of K is 10 and repetition is 3 times
train_control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# training the model by assigning sales column
# as target variable and rest other column
# as independent variable
model <- train(sales ~., data = marketing,
method = "lm",
trControl = train_control)
# printing model performance metrics
# along with other details
print(model)
输出:
Linear Regression
200 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ...
Resampling results:
RMSE Rsquared MAE
2.020061 0.9038559 1.541517
Tuning parameter 'intercept' was held constant at a value of TRUE
好处:
- 在每次重复中,数据样本都会被打乱,从而导致样本数据出现不同的拆分。
缺点:
- 每次重复时,算法都必须从头开始训练模型,这意味着评估模型的计算时间会随着重复次数的增加而增加。
Note: The most preferred cross-validation technique is repeated K-fold cross-validation for both regression and classification machine learning model.