使用 Statsmodels 的逻辑回归

先决条件：了解逻辑回归
逻辑回归是用于查找特定事件发生概率的回归分析类型。对于我们有一个只能采用离散值的分类因变量的情况，它是最适合的回归类型。

数据集：
在本文中，我们将根据学生的 gmat、gpa 分数和工作经验来预测学生是否会被特定大学录取。这里的因变量是一个二元 Logistic 变量，预计将严格采用两种形式之一，即被承认或未被承认。

建立逻辑回归模型：

Statsmodels是一个Python模块，提供各种函数来估计不同的统计模型和执行统计测试

首先，我们定义了依赖（ y ）和独立（ X ）变量的集合。如果因变量是非数字形式，则首先使用虚拟变量将其转换为数字。示例中用于训练模型的文件可在此处下载。
Statsmodels 提供了一个Logit()函数来执行逻辑回归。 Logit()函数接受y和X作为参数并返回Logit对象。然后将模型拟合到数据中。

Python3

# importing libraries
import statsmodels.api as sm
import pandas as pd
 
# loading the training dataset
df = pd.read_csv('logit_train1.csv', index_col = 0)
 
# defining the dependent and independent variables
Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df[['admitted']]
  
# building the model and fitting the data
log_reg = sm.Logit(ytrain, Xtrain).fit()

Python3

# printing the summary table
print(log_reg.summary())

Python3

# loading the testing dataset 
df = pd.read_csv('logit_test1.csv', index_col = 0)
 
# defining the dependent and independent variables
Xtest = df[['gmat', 'gpa', 'work_experience']]
ytest = df['admitted']
 
# performing predictions on the test datdaset
yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))
 
# comparing original and predicted values of y
print('Actual values', list(ytest.values))
print('Predictions :', prediction)

Python3

from sklearn.metrics import (confusion_matrix,
                           accuracy_score)
 
# confusion matrix
cm = confusion_matrix(ytest, prediction)
print ("Confusion Matrix : \n", cm)
 
# accuracy score of the model
print('Test accuracy = ', accuracy_score(ytest, prediction))

输出：

Optimization terminated successfully.
         Current function value: 0.352707
         Iterations 8

在输出中，“迭代”是指模型迭代数据的次数，试图优化模型。默认情况下，执行的最大迭代次数为 35 次，之后优化失败。

汇总表：

下面的汇总表为我们提供了关于回归结果的描述性总结。

Python3

# printing the summary table
print(log_reg.summary())

输出：

Logit Regression Results                           
==============================================================================
Dep. Variable:               admitted   No. Observations:                   30
Model:                          Logit   Df Residuals:                       27
Method:                           MLE   Df Model:                            2
Date:                Wed, 15 Jul 2020   Pseudo R-squ.:                  0.4912
Time:                        16:09:17   Log-Likelihood:                -10.581
converged:                       True   LL-Null:                       -20.794
Covariance Type:            nonrobust   LLR p-value:                 3.668e-05
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
gmat               -0.0262      0.011     -2.383      0.017      -0.048      -0.005
gpa                 3.9422      1.964      2.007      0.045       0.092       7.792
work_experience     1.1983      0.482      2.487      0.013       0.254       2.143
===================================================================================

汇总表中一些术语的解释：

coef ：回归方程中自变量的系数。
对数似然：最大似然估计（MLE）函数的自然对数。 MLE 是寻找导致最佳拟合的参数集的优化过程。
LL-Null：不包含自变量（仅包含截距）时模型的对数似然值。
伪R-squ。：替代最小二乘线性回归中的 R 平方值。它是空模型的对数似然与完整模型的对数似然之比。

预测新数据：

现在我们将在新的测试数据上测试我们的模型。测试数据是从此 csv 文件加载的。
predict()函数对于执行预测很有用。获得的预测是分数（介于 0 和 1 之间），表示被录取的概率。因此，这些值被四舍五入，以获得 1 或 0 的离散值。

Python3

# loading the testing dataset 
df = pd.read_csv('logit_test1.csv', index_col = 0)
 
# defining the dependent and independent variables
Xtest = df[['gmat', 'gpa', 'work_experience']]
ytest = df['admitted']
 
# performing predictions on the test datdaset
yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))
 
# comparing original and predicted values of y
print('Actual values', list(ytest.values))
print('Predictions :', prediction)

输出：

Optimization terminated successfully.
         Current function value: 0.352707
         Iterations 8
Actual values [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions : [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

测试模型的准确性：

Python3

from sklearn.metrics import (confusion_matrix,
                           accuracy_score)
 
# confusion matrix
cm = confusion_matrix(ytest, prediction)
print ("Confusion Matrix : \n", cm)
 
# accuracy score of the model
print('Test accuracy = ', accuracy_score(ytest, prediction))

输出：

Confusion Matrix : 
 [[6 0]
 [2 2]]
Test accuracy =  0.8