如何使用 R 编程在回归中包含因素?
分类变量(也称为因子或定性变量)是将观察值分类为组的变量。它们是字符串或数字,在统计建模中称为因子变量。将普通字符串变量保存为因子可以节省大量内存。因子也可以存储为级别或标签变量。它们具有有限数量的不同值,称为级别。例如,个体的性别是一个分类变量,可以采用两个级别:男性或女性。回归需要数字变量。因此,当研究人员想要在回归模型中包含分类变量时,需要采取步骤使结果具有可解释性。让我们用 R 语言的代码示例来看看这一切。
在 R 中的实现
将字符串或数字存储为因子
首先,让我们创建一个示例数据集。
R
# creating sample
samp <- sample(0:1, 20, replace = TRUE)
samp
R
samp <- sample(0:1, 20, replace = TRUE)
# converting sampleto factors
samp1 <- factor(samp)
# to find if its a factor lets use is.factor()
is.factor(samp1)
R
# creating string sample
str1 <- c("small", "big", "medium", "small", "small",
"big", "medium", "medium", "big")
str1
# will show output of string
f.str1<-factor(str1)
# check if f.str1 is factor or not
is.factor(f.str1)
# check if str1 is factor or not
is.factor(str1)
R
# creating sample with labels
lab <- factor(samp1, labels = c("sweet", "bitter"))
lab
R
str1 <- c("small", "big", "medium", "small", "small",
"big", "medium", "medium", "big")
# ordering the factors w.r.t levels
order <- ordered(str1,
levels = c("small", "medium", "big"))
order
f.order <- factor(order)
R
# another way to order
f.order = factor(str1,
levels = c("small", "medium", "big"),
ordered = TRUE)
R
mean(samp1)
# shows NA has output
mean(as.numeric(levels(samp1)[samp1]))
R
f.new <- f.order[f.order != "small"]
f.new
R
# consider a dataframe student
student <- data.frame(
# id of students
id = c (1:5),
# name of students
name = c("Payal", "Dan", "Misty", "Ryan", "Gargi"),
# gender of students
gender = c("F", "M", "F", "M", "F"),
# gender represented in numbers F-1,M-0
gender_num = c(1, 0, 1, 0, 1),
# the hours students stay at fests
hours = c(2.5, 4, 5.3, 3, 2.2)
)
student
R
# making the regression model
model <- lm(hours ~ gender, data = student)
summary(model)$coef
输出:
[1] 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 1
转换设置为因子的数字。
电阻
samp <- sample(0:1, 20, replace = TRUE)
# converting sampleto factors
samp1 <- factor(samp)
# to find if its a factor lets use is.factor()
is.factor(samp1)
输出:
[5]TRUE
现在对字符串做同样的事情。
电阻
# creating string sample
str1 <- c("small", "big", "medium", "small", "small",
"big", "medium", "medium", "big")
str1
# will show output of string
f.str1<-factor(str1)
# check if f.str1 is factor or not
is.factor(f.str1)
# check if str1 is factor or not
is.factor(str1)
输出:
[1]"small" "big" "medium" "small" "small" "big" "medium" "medium" "big"
[10]TRUE
[12]FALSE
带标签的因素
电阻
# creating sample with labels
lab <- factor(samp1, labels = c("sweet", "bitter"))
lab
输出:
bitter bitter sweet bitter bitter bitter sweet sweet bitter sweet
[11] sweet bitter bitter bitter bitter bitter bitter bitter sweet bitter
Levels: sweet bitter
有序因子
电阻
str1 <- c("small", "big", "medium", "small", "small",
"big", "medium", "medium", "big")
# ordering the factors w.r.t levels
order <- ordered(str1,
levels = c("small", "medium", "big"))
order
f.order <- factor(order)
输出:
[1] small big medium small small big medium medium big
Levels: small < medium < big
另一种使因子排序的方法是:
电阻
# another way to order
f.order = factor(str1,
levels = c("small", "medium", "big"),
ordered = TRUE)
寻找均值
电阻
mean(samp1)
# shows NA has output
mean(as.numeric(levels(samp1)[samp1]))
NA
0.7
删除级别
电阻
f.new <- f.order[f.order != "small"]
f.new
输出:
[1] big medium big medium medium big
Levels: small < medium < big
回归实施
将实验视为学生在节日期间留在学校的时间。
电阻
# consider a dataframe student
student <- data.frame(
# id of students
id = c (1:5),
# name of students
name = c("Payal", "Dan", "Misty", "Ryan", "Gargi"),
# gender of students
gender = c("F", "M", "F", "M", "F"),
# gender represented in numbers F-1,M-0
gender_num = c(1, 0, 1, 0, 1),
# the hours students stay at fests
hours = c(2.5, 4, 5.3, 3, 2.2)
)
student
输出:
id name gender gender_num hours
1 1 Payal F 1 2.5
2 2 Dan M 0 4.0
3 3 Misty F 1 5.3
4 4 Ryan M 0 3.0
5 5 Gargi F 1 2.2
回归方程是
y = b 0 + b 1 *x
Where
y: output variable predicted on the basis of a predictor variable (x),
b0 + b1: beta coefficients, representing the intercept and the slope, respectively.
b 0 + b 1:如果学生是男性, b 0:如果学生是女性。系数可以解释如下:
- b 0是女学生参加节日的平均时间,
- b 0 + b 1是男学生参加节日的平均时间,
- b 1是男女学生的平均学时差异。
R 使用以下代码自动创建虚拟变量:
电阻
# making the regression model
model <- lm(hours ~ gender, data = student)
summary(model)$coef
输出:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.3333333 0.8397531 3.9694209 0.02857616
genderM 0.1666667 1.3277662 0.1255241 0.90804814
F 学生的估计值为 3.3333333,M 学生的估计值为 0.16666667。 M学生和F学生的Pr值不那么显着,只有0.90-0.02~0.9,即没有实际证据表明M学生的停留时间比女性多。