如何使用 R 编程在回归中包含因素？

分类变量（也称为因子或定性变量）是将观察值分类为组的变量。它们是字符串或数字，在统计建模中称为因子变量。将普通字符串变量保存为因子可以节省大量内存。因子也可以存储为级别或标签变量。它们具有有限数量的不同值，称为级别。例如，个体的性别是一个分类变量，可以采用两个级别：男性或女性。回归需要数字变量。因此，当研究人员想要在回归模型中包含分类变量时，需要采取步骤使结果具有可解释性。让我们用 R 语言的代码示例来看看这一切。

在 R 中的实现

将字符串或数字存储为因子

首先，让我们创建一个示例数据集。

R

# creating sample
samp <- sample(0:1, 20, replace = TRUE)
samp

R

samp <- sample(0:1, 20, replace = TRUE)
  
# converting sampleto factors
samp1 <- factor(samp)
  
# to find if its a factor lets use is.factor()
is.factor(samp1)

R

# creating string sample
str1 <- c("small", "big", "medium", "small", "small",
          "big", "medium", "medium", "big")
str1
  
# will show output of string
f.str1<-factor(str1)
  
# check if f.str1 is factor or not
is.factor(f.str1)
  
# check if str1 is factor or not
is.factor(str1)

R

# creating sample with labels
lab <- factor(samp1, labels = c("sweet", "bitter"))
lab

R

str1 <- c("small", "big", "medium", "small", "small",
          "big", "medium", "medium", "big")
  
# ordering the factors w.r.t levels
order <- ordered(str1, 
                 levels = c("small", "medium", "big"))
order
f.order <- factor(order)

R

# another way to order 
f.order = factor(str1, 
                 levels = c("small", "medium", "big"),
                 ordered = TRUE)

R

mean(samp1)
  
# shows NA has output
mean(as.numeric(levels(samp1)[samp1]))

R

f.new <- f.order[f.order != "small"]
f.new

R

# consider a dataframe student
student <- data.frame(
    # id of students
    id = c (1:5), 
    
    # name of students  
    name = c("Payal", "Dan", "Misty", "Ryan", "Gargi"),
    
    # gender of students
    gender = c("F", "M", "F", "M", "F"),
    
    # gender represented in numbers F-1,M-0
    gender_num = c(1, 0, 1, 0, 1),
    
    # the hours students stay at fests
    hours = c(2.5, 4, 5.3, 3, 2.2)
)
student

R

# making the regression model
model <- lm(hours ~ gender, data = student) 
summary(model)$coef

输出：

[1] 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 1

转换设置为因子的数字。

电阻

samp <- sample(0:1, 20, replace = TRUE)
  
# converting sampleto factors
samp1 <- factor(samp)
  
# to find if its a factor lets use is.factor()
is.factor(samp1)

输出：

[5]TRUE

现在对字符串做同样的事情。

电阻

# creating string sample
str1 <- c("small", "big", "medium", "small", "small",
          "big", "medium", "medium", "big")
str1
  
# will show output of string
f.str1<-factor(str1)
  
# check if f.str1 is factor or not
is.factor(f.str1)
  
# check if str1 is factor or not
is.factor(str1)

输出：

[1]"small"  "big"    "medium" "small"  "small"  "big"    "medium" "medium" "big"   
[10]TRUE
[12]FALSE

带标签的因素

电阻

# creating sample with labels
lab <- factor(samp1, labels = c("sweet", "bitter"))
lab

输出：

bitter bitter sweet  bitter bitter bitter sweet  sweet  bitter sweet 
[11] sweet  bitter bitter bitter bitter bitter bitter bitter sweet  bitter
Levels: sweet bitter

有序因子

电阻

str1 <- c("small", "big", "medium", "small", "small",
          "big", "medium", "medium", "big")
  
# ordering the factors w.r.t levels
order <- ordered(str1, 
                 levels = c("small", "medium", "big"))
order
f.order <- factor(order)

输出：

[1] small  big    medium small  small  big    medium medium big   
Levels: small < medium < big

另一种使因子排序的方法是：

电阻

# another way to order 
f.order = factor(str1, 
                 levels = c("small", "medium", "big"),
                 ordered = TRUE)

寻找均值

电阻

mean(samp1)
  
# shows NA has output
mean(as.numeric(levels(samp1)[samp1]))

NA
0.7

删除级别

电阻

f.new <- f.order[f.order != "small"]
f.new

输出：

[1] big    medium big    medium medium big   
Levels: small < medium < big

回归实施

将实验视为学生在节日期间留在学校的时间。

电阻

# consider a dataframe student
student <- data.frame(
    # id of students
    id = c (1:5), 
    
    # name of students  
    name = c("Payal", "Dan", "Misty", "Ryan", "Gargi"),
    
    # gender of students
    gender = c("F", "M", "F", "M", "F"),
    
    # gender represented in numbers F-1,M-0
    gender_num = c(1, 0, 1, 0, 1),
    
    # the hours students stay at fests
    hours = c(2.5, 4, 5.3, 3, 2.2)
)
student

输出：

id  name   gender gender_num hours
1  1 Payal      F          1   2.5
2  2   Dan      M          0   4.0
3  3 Misty      F          1   5.3
4  4  Ryan      M          0   3.0
5  5 Gargi      F          1   2.2

回归方程是

y = b ₀ + b ₁ *x

Where

y: output variable predicted on the basis of a predictor variable (x),

b₀ + b₁: beta coefficients, representing the intercept and the slope, respectively.

编程需要懂一点英语

b ₀ + b _1：如果学生是男性， b _0：如果学生是女性。系数可以解释如下：

b ₀是女学生参加节日的平均时间，
b ₀ + b ₁是男学生参加节日的平均时间，
b ₁是男女学生的平均学时差异。

R 使用以下代码自动创建虚拟变量：

电阻

# making the regression model
model <- lm(hours ~ gender, data = student) 
summary(model)$coef

输出：

Estimate Std. Error   t value   Pr(>|t|)
(Intercept) 3.3333333  0.8397531 3.9694209 0.02857616
genderM     0.1666667  1.3277662 0.1255241 0.90804814

F 学生的估计值为 3.3333333，M 学生的估计值为 0.16666667。 M学生和F学生的Pr值不那么显着，只有0.90-0.02~0.9，即没有实际证据表明M学生的停留时间比女性多。