R 编程中分类变量的回归

回归是一个多步骤过程，用于估计因变量与一个或多个自变量（也称为预测变量或协变量）之间的关系。回归分析主要用于两个概念上不同的目的：预测和预测，其使用与机器学习领域有很大的重叠；其次，它有时可用于推断自变量和因变量之间的关系。

分类变量回归

分类变量是可以采用有限和固定数量的可能值之一的变量，根据某些定性属性将每个个体或其他观察单位分配给特定的组或名义类别。它们也被称为因子或定性变量。最适合分类变量的回归分析类型是逻辑回归。 Logistic 回归使用最大似然估计来估计参数。它推导出一组变量（独立）和一个分类变量（从属）之间的关系。使用 R 语言实现回归模型要容易得多，因为它内置了优秀的库。现在，让我们尝试建立一个带有分类变量的逻辑回归模型，以便更好地理解。

示例：目标是通过 gre、gpa 和 rank 等变量预测候选人是否会被大学录取。 R 脚本并排提供，并进行注释以更好地理解用户。数据采用 .csv 格式。我们将使用getwd()函数获取工作目录，并在其中放置数据集 binary.csv 以进一步进行。请在此处下载 CSV 文件。

R

# preparing the dataset 
getwd() 
data <- read.csv("binary.csv" ) 
str(data)

R

# converting admit and rank
# columns into factor variables
data$admit = as.factor(data$admit)
data$rank = as.factor(data$rank)
  
# two-way table of factor
# variable
xtabs(~admit + rank, data = data)

R

# Partitioning of data
set.seed(1234)
data1<-sample(2, nrow(data), 
                   replace = T, 
                 prob = c(0.6, 0.4))
train<-data[data1 == 1,]
test<-data[data1 == 2,]

R

mymodel<-glm(admit~gre + gpa + rank, 
                        data = train, 
                        family = 'binomial')
summary(mymodel)

R

mymodel<-glm(admit~gpa + rank, 
                  data = train, 
                 family = 'binomial')

R

# Prediction
p1<-predict(mymodel, train, 
            type = 'response')
head(p1)

R

head(train)

R

# confusion Matrix 
# $Misclassification error -Training data  
pre1<-ifelse(p1 > 0.5, 1, 0)
table<-table(Prediction = pre1, 
             Actual = train$admit) 
table

R

1 - sum(diag(table)) / sum(table)

输出：

'data.frame':    400 obs. of  4 variables:
 $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
 $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
 $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
 $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

查看数据集的结构，我们可以观察到它有 4 个变量，其中 Admit 表示候选人是否会被录取（1 如果被录取，0 如果不被录取）gre、gpa 和 rank 给出候选人的 gre 分数，他/她在以前的大学和以前的大学排名中的gpa。我们使用admit 作为因变量，gre、gpa 和rank 作为自变量。现在，请注意承认和排名是分类变量，但属于数字类型。但是为了在我们的模型中将它们用作分类变量，我们将使用as.factor()函数将它们转换为因子变量。

电阻

# converting admit and rank
# columns into factor variables
data$admit = as.factor(data$admit)
data$rank = as.factor(data$rank)
  
# two-way table of factor
# variable
xtabs(~admit + rank, data = data)

输出：

rank
admit  1  2  3  4
    0 28 97 93 55
    1 33 54 28 12

现在将数据分成训练集和测试集。训练集用于寻找因变量和自变量之间的关系，而测试集用于分析模型的性能。我们使用 60% 的数据集作为训练集。将数据分配到训练集和测试集是使用随机抽样完成的。我们使用sample()函数对 R 进行随机抽样。使用set.seed()每次生成相同的随机样本并保持一致性。

电阻

# Partitioning of data
set.seed(1234)
data1<-sample(2, nrow(data), 
                   replace = T, 
                 prob = c(0.6, 0.4))
train<-data[data1 == 1,]
test<-data[data1 == 2,]

现在为我们的数据构建逻辑回归模型。 glm()函数帮助我们为我们的数据建立一个神经网络。我们在这里使用的glm()函数具有以下语法。

Syntax:

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset,

control = list(…), model = TRUE, method = “glm.fit”, x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)

编程需要懂一点英语

Parameter	Description
formula	a symbolic description of the model to be fitted.
family	a description of the error distribution and link function to be used in the model.
data	an optional data frame.
weights	an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector.
subset	an optional vector specifying a subset of observations to be used in the fitting process.
na.action	a function which indicates what should happen when the data contain NAs.
start	starting values for the parameters in the linear predictor.
etastart	starting values for the linear predictor.
mustart	starting values for the vector of means.
offset	this can be used to specify an a priori known component to be included in the linear predictor during fitting.
control	a list of parameters for controlling the fitting process.
model	a logical value indicating whether model frame should be included as a component of the returned value.
method	the method to be used in fitting the model.
x,y	logical values indicating whether the response vector and model matrix used in the fitting process should be returned as components of the returned value.
singular.ok	logical; if FALSE a singular fit is an error.
contrasts	an optional list.
…	arguments to be used to form the default control argument if it is not supplied directly.

电阻

mymodel<-glm(admit~gre + gpa + rank, 
                        data = train, 
                        family = 'binomial')
summary(mymodel)

输出：

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6576  -0.8724  -0.6184   1.0683   2.1035  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -4.972329   1.518865  -3.274  0.00106 **
gre          0.001449   0.001405   1.031  0.30270   
gpa          1.233117   0.450550   2.737  0.00620 **
rank2       -0.784080   0.406376  -1.929  0.05368 . 
rank3       -1.203013   0.426614  -2.820  0.00480 **
rank4       -1.699652   0.536974  -3.165  0.00155 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 312.66  on 248  degrees of freedom
Residual deviance: 283.38  on 243  degrees of freedom
AIC: 295.38

Number of Fisher Scoring iterations: 4

从模型的总结来看，很明显 gre 在预测中没有显着作用，因此我们可以将其从模型中删除并重写如下：

电阻

mymodel<-glm(admit~gpa + rank, 
                  data = train, 
                 family = 'binomial')

现在，让我们尝试通过做出一些预测来分析我们的回归模型。

电阻

# Prediction
p1<-predict(mymodel, train, 
            type = 'response')
head(p1)

输出：

1         7         8        10        12        13 
0.3013327 0.3784012 0.2414806 0.5116852 0.4610888 0.7211702

电阻

head(train)

输出：

admit gre  gpa rank
1      0 380 3.61    3
7      1 560 2.98    1
8      0 400 3.08    2
10     0 700 3.92    2
12     0 440 3.22    1
13     1 760 4.00    1

然后，我们通过创建混淆矩阵来对结果进行四舍五入，以比较真/假阳性和阴性的数量。我们将用训练数据形成一个混淆矩阵。

电阻

# confusion Matrix 
# $Misclassification error -Training data  
pre1<-ifelse(p1 > 0.5, 1, 0)
table<-table(Prediction = pre1, 
             Actual = train$admit) 
table

输出：

Actual
Prediction   0   1
         0 158  55
         1  11  25

该模型生成 158 个真阴性 (0)、25 个真阳性 (1)，同时有 11 个假阴性和 55 个假阳性。现在，让我们计算误分类误差（对于训练数据）{1 – 分类误差}

电阻

1 - sum(diag(table)) / sum(table)

输出：

[1] 0.2650602

误分类错误率为 24.9%。在这种情况下，我们可以使用带有分类变量的回归技术来处理各种其他数据。

回归分析是一种非常有效的方法，可以使用多种类型的回归模型。这种选择通常取决于您拥有的因变量数据类型以及提供最佳拟合的模型类型，例如逻辑回归最适合分类变量。