R 编程中的验证集方法
验证集方法是机器学习中的一种交叉验证技术。交叉验证技术通常用于判断机器学习模型的性能和准确性。在验证集方法中,将用于构建模型的数据集随机分为两部分,即训练集和验证集(或测试集)。该模型在训练数据集上进行训练,其准确性是通过预测训练期间不存在的那些数据点的目标变量来计算的,这些数据点是验证集。拆分数据、训练模型、测试模型的整个过程是一项复杂的任务。但是 R 语言由许多库和内置函数组成,可以非常轻松有效地执行所有任务。
验证集方法涉及的步骤
- 将数据集随机拆分成一定比例(一般首选 70-30 或 80-20 比例)
- 在训练数据集上训练模型
- 结果模型应用于验证集
- 使用模型性能指标通过预测误差计算模型的准确性
本文讨论了将验证集方法实现为分类和回归机器学习模型的交叉验证技术的分步方法。
对于分类机器学习模型
当目标变量是分类变量(如阳性、阴性或糖尿病、非糖尿病等)时,使用这种类型的机器学习模型。该模型预测因变量的类别标签。在这里,将应用逻辑回归算法来构建分类模型。
第 1 步:加载数据集和其他所需的包
在进行任何探索或操作任务之前,必须包含所有必需的库和包以使用各种内置函数和一个数据集,这将使整个过程更容易执行。
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# package Used to split the data
# used during classification into
# train and test subsets
library(caTools)
# loading package to
# import desired dataset
library(ISLR)
R
# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)
R
# setting seed to generate a
# reproducible random sampling
set.seed(100)
# dividing the complete dataset
# into 2 parts having ratio of
# 70% and 30%
spl = sample.split(dataset$Direction, SplitRatio = 0.7)
# selecting that part of dataset
# which belongs to the 70% of the
# dataset divided in previous step
train = subset(dataset, spl == TRUE)
# selecting that part of dataset
# which belongs to the 30% of the
# dataset divided in previous step
test = subset(dataset, spl == FALSE)
# checking number of rows and column
# in training and testing dataset
print(dim(train))
print(dim(test))
# Building the model
# training the model by assigning Direction column
# as target variable and rest other columns
# as independent variables
model_glm = glm(Direction ~ . , family = "binomial",
data = train, maxit = 100)
R
# predictions on the validation set
predictTest = predict(model_glm, newdata = test,
type = "response")
# assigning the probability cutoff as 0.5
predicted_classes <- as.factor(ifelse(predictTest >= 0.5,
"Up", "Down"))
R
# generating confusion matrix and
# other details from the
# prediction made by the model
print(confusionMatrix(predicted_classes, test$Direction))
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# access the data from R’s datasets package
data(trees)
# look at the first several rows of the data
head(trees)
R
# reproducible random sampling
set.seed(123)
# creating training data as 80% of the dataset
random_sample <- createDataPartition(trees $ Volume,
p = 0.8, list = FALSE)
# generating training dataset
# from the random_sample
training_dataset <- trees[random_sample, ]
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- trees[-random_sample, ]
# Building the model
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(Volume ~., data = training_dataset)
R
# predicting the target variable
predictions <- predict(model, testing_dataset)
R
# computing model performance metrics
data.frame(R2 = R2(predictions, testing_dataset $ Volume),
RMSE = RMSE(predictions, testing_dataset $ Volume),
MAE = MAE(predictions, testing_dataset $ Volume))
第 2 步:探索数据集
了解数据集的结构和维度非常必要,因为这将有助于建立正确的模型。此外,由于这是一种分类模型,因此必须知道目标变量中存在的不同类别。
R
# assigning the complete dataset
# Smarket to a variable
dataset <- Smarket[complete.cases(Smarket), ]
# display the dataset with details
# like column name and its data type
# along with values in each row
glimpse(dataset)
# checking values present
# in the Direction column
# of the dataset
table(dataset$Direction)
输出:
Rows: 1,250
Columns: 9
$ Year 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2 -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3 -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4 -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...
> table(dataset$Direction)
Down Up
602 648
根据以上信息,导入的数据集有250行9列。
此外,响应变量或目标变量是二元分类变量(因为列中的值只有 Down 和 Up),两个类标签的比例约为 1:1,表示它们是平衡的。如果出现类不平衡的情况,好像类标签的比例是 1:2,我们必须确保两个类别的比例大致相等。为此,有许多技术,例如:
- 下采样
- 上采样
- 使用 SMOTE 和 ROSE 的混合采样
第 3 步:构建模型并生成验证集
此步骤涉及数据集的随机拆分、开发训练和验证集以及模型的训练。下面是实现。
R
# setting seed to generate a
# reproducible random sampling
set.seed(100)
# dividing the complete dataset
# into 2 parts having ratio of
# 70% and 30%
spl = sample.split(dataset$Direction, SplitRatio = 0.7)
# selecting that part of dataset
# which belongs to the 70% of the
# dataset divided in previous step
train = subset(dataset, spl == TRUE)
# selecting that part of dataset
# which belongs to the 30% of the
# dataset divided in previous step
test = subset(dataset, spl == FALSE)
# checking number of rows and column
# in training and testing dataset
print(dim(train))
print(dim(test))
# Building the model
# training the model by assigning Direction column
# as target variable and rest other columns
# as independent variables
model_glm = glm(Direction ~ . , family = "binomial",
data = train, maxit = 100)
输出:
> print(dim(train))
[1] 875 9
> print(dim(test))
[1] 375 9
第 4 步:预测目标变量
随着模型的训练完成,是时候对看不见的数据进行预测了。在这里,目标变量只有 2 个可能的值,因此在predict()函数中,最好使用type = response ,以便模型将目标分类变量的概率得分预测为 0 或 1。
有一个可选步骤将响应变量转换为 1 和 0 的因子变量,这样如果数据点的概率得分高于某个阈值,则将其视为 1,如果低于该阈值,则将其视为0. 这里,概率截止设置为 0.5。下面是实现这些步骤的代码
R
# predictions on the validation set
predictTest = predict(model_glm, newdata = test,
type = "response")
# assigning the probability cutoff as 0.5
predicted_classes <- as.factor(ifelse(predictTest >= 0.5,
"Up", "Down"))
第 5 步:评估模型的准确性
判断分类机器学习模型准确性的最佳方法是通过混淆矩阵。该矩阵为我们提供了一个数值,该数值表明通过参考测试数据集中目标变量的实际值,正确和错误地预测了多少数据点。除了混淆矩阵,模型的其他统计细节,如准确率和 kappa 可以使用以下代码计算。
R
# generating confusion matrix and
# other details from the
# prediction made by the model
print(confusionMatrix(predicted_classes, test$Direction))
输出:
Confusion Matrix and Statistics
Reference
Prediction Down Up
Down 177 5
Up 4 189
Accuracy : 0.976
95% CI : (0.9549, 0.989)
No Information Rate : 0.5173
P-Value [Acc > NIR] : <2e-16
Kappa : 0.952
Mcnemar's Test P-Value : 1
Sensitivity : 0.9779
Specificity : 0.9742
Pos Pred Value : 0.9725
Neg Pred Value : 0.9793
Prevalence : 0.4827
Detection Rate : 0.4720
Detection Prevalence : 0.4853
Balanced Accuracy : 0.9761
'Positive' Class : Down
对于回归机器学习模型
回归模型用于预测性质连续的数量,例如房屋价格、产品销售等。通常在回归问题中,目标变量是实数,例如整数或浮点值。这种模型的准确性是通过在预测各种数据点的输出时取误差的平均值来计算的。以下是在线性回归模型中实现验证集方法的步骤。
第 1 步:加载数据集和所需的包
R 语言包含各种数据集。在这里,我们使用的是树数据集,它是线性回归模型的内置数据集。下面是导入所需数据集和包以执行各种操作以构建模型的代码。
R
# loading required packages
# package to perform data manipulation
# and visualization
library(tidyverse)
# package to compute
# cross - validation methods
library(caret)
# access the data from R’s datasets package
data(trees)
# look at the first several rows of the data
head(trees)
输出:
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
因此,在这个数据集中,共有 3 列,其中Volume是目标变量。由于变量具有连续性,因此可以使用线性回归算法来预测结果。
第 2 步:构建模型并生成验证集
在这一步中,模型被随机分成 80-20 的比例。 80% 的数据点将用于训练模型,而 20% 用作验证集,这将为我们提供模型的准确性。下面是相同的代码。
R
# reproducible random sampling
set.seed(123)
# creating training data as 80% of the dataset
random_sample <- createDataPartition(trees $ Volume,
p = 0.8, list = FALSE)
# generating training dataset
# from the random_sample
training_dataset <- trees[random_sample, ]
# generating testing dataset
# from rows which are not
# included in random_sample
testing_dataset <- trees[-random_sample, ]
# Building the model
# training the model by assigning sales column
# as target variable and rest other columns
# as independent variables
model <- lm(Volume ~., data = training_dataset)
第 3 步:预测目标变量
在建立和训练模型之后,将对属于验证集的数据点的目标变量进行预测。
R
# predicting the target variable
predictions <- predict(model, testing_dataset)
第 4 步:评估模型的准确性
用于评估线性回归模型性能的统计指标是均方根误差 (RMSE)、均方误差 (MAE) 、 和R 2错误。在所有 R 2 Error 中,metric 的判断最准确,它的值必须高才能更好的模型。下面是计算模型预测误差的代码。
R
# computing model performance metrics
data.frame(R2 = R2(predictions, testing_dataset $ Volume),
RMSE = RMSE(predictions, testing_dataset $ Volume),
MAE = MAE(predictions, testing_dataset $ Volume))
输出:
R2 RMSE MAE
1 0.9564487 5.274129 4.73567
验证集方法的优点
- 评估模型的最基本和最简单的技术之一。
- 没有复杂的实施步骤。
验证集方法的缺点
- 模型所做的预测高度依赖于用于训练和验证的观察子集。
- 仅将数据的一个子集用于训练目的会使模型产生偏差。