如何使用 R 编程在回归中包含交互?
在本文中,我们将研究什么是交互,以及我们是否应该在我们的模型中使用交互来获得更好的结果。假设 X1 和 X2 是数据集的特征,而 Y 是我们试图预测的类标签或输出。然后,如果 X1 和 X2 相互作用,这意味着 X1 对 Y 的影响取决于 X2 的值,反之亦然,那么数据集特征之间的相互作用在哪里。现在我们知道我们的数据集是否包含交互。我们还应该知道何时在我们的模型中考虑交互以获得更好的精度或准确度。我们将使用 R 语言来实现这一点。
我们应该在我们的模型中包含交互吗?
在模型中包含交互之前,您应该问两个问题:
- 这种交互在概念上有意义吗?
- 交互项在统计上是否显着?或者,我们是否认为回归线的斜率有显着差异。
在 R 中的实现
我们通过一个例子来看看线性回归模型中的交互作用。
- 数据集
- 肺活量数据集
- 参数/变量:
- 自变量(Y):肺盖
- 因变量(X1):烟雾(是/否)
- 因变量(X2):年龄
例子
步骤 1:加载数据集
R
# Read in the Lung Cap Data
LungCapData <- read.table(file.choose(),
header = T,
sep = "\t")
# Attach LungCapData
attach(LungCapData)
R
# Plot the data, using different
# colours for smoke(red)/non-smoke(blue)
# First, plot the data for
# the Non-Smokers, in Blue
plot(Age[Smoke == "no"],
LungCap[Smoke == "no"],
col = "blue",
ylim = c(0, 15), xlim = c(0, 20),
xlab = "Age", ylab = "LungCap",
main = "LungCap vs. Age,Smoke")
R
# Now, add in the points for
# the Smokers, in Solid Red Circles
points(Age[Smoke == "yes"],
LungCap[Smoke == "yes"],
col = "red", pch = 16)
R
# And, add in a legend
legend(1, 15,
legend = c("NonSmoker", "Smoker"),
col = c("blue", "red"),
pch = c(1, 16), bty = "n")
R
# Fit a Reg Model, using Age,
# Smoke, and their INTERACTION
model1 <- lm(LungCap ~ Age*Smoke)
coef(model1)
R
# Note, that the "*" fits a model with
# Age, Smoke and AgeXSmoke INT.
# Note, also that the same model
# can be fit using the ":"
model1 <- lm(LungCap ~ Age + Smoke + Age:Smoke)
# Ask for a summary of the model
summary(model1)
R
# Now, let's add in the regression
# lines from our mode using the
# abline command for the Non-Smokers, in Blue
abline(a = 1.052, b = 0.558,
col = "blue", lwd = 3)
R
# And now, add in the line for Smokers, in Red
abline(a = 1.278, b = 0.498,
col = "red", lwd = 3)
R
# Ask for that model summary again
summary(model1)
# Fit the model that does
# NOT include INTERACTION
model2 <- lm(LungCap ~ Age + Smoke)
summary(model2)
第 2 步:绘制数据,对烟雾(红色)/非吸烟者(蓝色)使用不同的颜色
电阻
# Plot the data, using different
# colours for smoke(red)/non-smoke(blue)
# First, plot the data for
# the Non-Smokers, in Blue
plot(Age[Smoke == "no"],
LungCap[Smoke == "no"],
col = "blue",
ylim = c(0, 15), xlim = c(0, 20),
xlab = "Age", ylab = "LungCap",
main = "LungCap vs. Age,Smoke")
输出:
电阻
# Now, add in the points for
# the Smokers, in Solid Red Circles
points(Age[Smoke == "yes"],
LungCap[Smoke == "yes"],
col = "red", pch = 16)
输出:
电阻
# And, add in a legend
legend(1, 15,
legend = c("NonSmoker", "Smoker"),
col = c("blue", "red"),
pch = c(1, 16), bty = "n")
输出:
步骤 3. 拟合 Reg 模型,在回归线中使用 Age、Smoke 和它们的 INTERACTION 和 Add
电阻
# Fit a Reg Model, using Age,
# Smoke, and their INTERACTION
model1 <- lm(LungCap ~ Age*Smoke)
coef(model1)
输出:
(Intercept) Age Smokeyes Age:Smokeyes
1.05157244 0.55823350 0.22601390 -0.05970463
电阻
# Note, that the "*" fits a model with
# Age, Smoke and AgeXSmoke INT.
# Note, also that the same model
# can be fit using the ":"
model1 <- lm(LungCap ~ Age + Smoke + Age:Smoke)
# Ask for a summary of the model
summary(model1)
输出:
Call:
lm(formula = LungCap ~ Age + Smoke + Age:Smoke)
Residuals:
Min 1Q Median 3Q Max
-4.8586 -1.0174 -0.0251 1.0004 4.1996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05157 0.18706 5.622 2.7e-08 ***
Age 0.55823 0.01473 37.885 < 2e-16 ***
Smokeyes 0.22601 1.00755 0.224 0.823
Age:Smokeyes -0.05970 0.06759 -0.883 0.377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.515 on 721 degrees of freedom
Multiple R-squared: 0.6776, Adjusted R-squared: 0.6763
F-statistic: 505.1 on 3 and 721 DF, p-value: < 2.2e-16
第 4 步:让我们使用 abline 命令从我们的模型中添加回归线
电阻
# Now, let's add in the regression
# lines from our mode using the
# abline command for the Non-Smokers, in Blue
abline(a = 1.052, b = 0.558,
col = "blue", lwd = 3)
输出:
电阻
# And now, add in the line for Smokers, in Red
abline(a = 1.278, b = 0.498,
col = "red", lwd = 3)
输出:
电阻
# Ask for that model summary again
summary(model1)
# Fit the model that does
# NOT include INTERACTION
model2 <- lm(LungCap ~ Age + Smoke)
summary(model2)
输出:
> summary(model1)
Call:
lm(formula = LungCap ~ Age + Smoke + Age:Smoke)
Residuals:
Min 1Q Median 3Q Max
-4.8586 -1.0174 -0.0251 1.0004 4.1996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05157 0.18706 5.622 2.7e-08 ***
Age 0.55823 0.01473 37.885 < 2e-16 ***
Smokeyes 0.22601 1.00755 0.224 0.823
Age:Smokeyes -0.05970 0.06759 -0.883 0.377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.515 on 721 degrees of freedom
Multiple R-squared: 0.6776, Adjusted R-squared: 0.6763
F-statistic: 505.1 on 3 and 721 DF, p-value: < 2.2e-16
> summary(model2)
Call:
lm(formula = LungCap ~ Age + Smoke)
Residuals:
Min 1Q Median 3Q Max
-4.8559 -1.0289 -0.0363 1.0083 4.1995
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.08572 0.18299 5.933 4.61e-09 ***
Age 0.55540 0.01438 38.628 < 2e-16 ***
Smokeyes -0.64859 0.18676 -3.473 0.000546 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.514 on 722 degrees of freedom
Multiple R-squared: 0.6773, Adjusted R-squared: 0.6764
F-statistic: 757.5 on 2 and 722 DF, p-value: < 2.2e-16