R 编程中回归的随机森林方法(1)

📌 相关文章

📜 R 编程中回归的随机森林方法(1)

📅 最后修改于: 2023-12-03 15:34:35.964000 🧑 作者: Mango

R编程中回归的随机森林方法

介绍

随机森林是一种强大的机器学习方法，它由许多决策树组成，每棵决策树都是用随机选择的特征来训练的。随机森林可以用于分类和回归等任务。

本文将重点介绍R编程语言中回归问题的随机森林方法。

示例数据

我们将使用mtcars数据集作为示例数据：

data(mtcars)
head(mtcars)

输出结果：

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

我们的任务是将mpg作为响应变量（因变量），其它所有变量都作为解释变量（自变量）。我们将使用随机森林来构建一个回归模型，以预测汽车的燃油效率（mpg）。

随机森林回归模型

安装和加载随机森林包

首先，我们需要安装和加载randomForest包。运行以下命令：

install.packages("randomForest")
library(randomForest)

构建随机森林回归模型

我们将使用randomForest()函数构建回归模型。以下是函数参数的说明：

x: 自变量矩阵或数据框
y: 响应变量
ntree: 决策树数量
importance: 是否计算变量重要性
mtry: 每棵树使用的变量数
replace: 是否用放回的方式抽样样本
nodesize: 分裂节点的最小观察值数

运行以下命令构建随机森林回归模型：

set.seed(123)
rf_model <- randomForest(mpg ~ ., data = mtcars, ntree = 500, importance = TRUE, mtry = 3, replace = TRUE, nodesize = 5)

说明：

set.seed(123)是为了确保结果是可重复的
mpg ~ .表示我们使用其它所有变量来预测mpg
importance = TRUE表示计算变量重要性
mtry = 3表示每棵决策树使用3个随机选择的变量
replace = TRUE表示使用放回抽样方式
nodesize = 5表示节点最少包括5个观察值，避免树生长太深而过拟合

查看模型结果

我们可以通过summary()函数来查看模型的基本信息和统计量。

summary(rf_model)

输出结果：

                Length Class  Mode     
call            4      -none- call     
type            1      -none- character
predicted      32      -none- numeric  
mse             1      -none- numeric  
rsq             1      -none- numeric  
r2              1      -none- numeric  
error           1      -none- numeric  
oob.times      32      -none- numeric  
importance     19      -none- numeric  
importanceSD    0      -none- NULL     
localImportance 0      -none- NULL     
proximity       0      -none- NULL     
ntree           1      -none- numeric  
mtry            1      -none- numeric  
forest          4      -none- list     
coefs           0      -none- NULL     
y               0      -none- NULL     
test            0      -none- NULL     
inbag           0      -none- NULL     
terms           3      terms  call

我们还可以通过print()函数来查看模型的详细信息。

print(rf_model)

输出结果：

Call:
 randomForest(formula = mpg ~ ., data = mtcars, ntree = 500, importance = TRUE,      mtry = 3, replace = TRUE, nodesize = 5) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 5.693685
                    % Var explained: 89.44

我们可以看到模型平均方差（Mean of squared residuals）为5.693685，解释变量总方差的百分比（% Var explained）为89.44。

变量重要性

我们可以使用importance()函数计算变量重要性。

varimp <- importance(rf_model)
varimp <- data.frame(variable = row.names(varimp), importance = varimp[,1])

注意，importance()返回的是一个矩阵，我们需要将其转换为数据框。

我们可以使用以下命令来将变量按重要性排序并可视化。

library(ggplot2)
ggplot(varimp, aes(x = reorder(variable, importance), y = importance)) + 
  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) + 
  labs(title = "Variable Importance Plot", x = "Variable", y = "Importance")

输出结果：

Variable Importance Plot

我们可以看到，wt（车重）是与燃油效率最相关的因素，其次是hp（马力）和disp（发动机排量）。

结论

通过R编程语言中randomForest包实现了回归问题的随机森林方法。我们构建了一个随机森林回归模型来预测汽车的燃油效率，并计算了变量重要性。随机森林是一个强大的机器学习方法，可以应用于各种回归和分类任务。