使用 Logistic 回归进行位置预测

先决条件：了解逻辑回归，使用Python进行逻辑回归

在本文中，我们将讨论如何使用 Logistic 回归算法基于各种学生属性来预测学生的安置状态。

安置对学生和教育机构非常重要。它可以帮助学生为未来的职业生涯奠定坚实的基础，并且良好的安置记录为大学/大学在教育市场上提供了竞争优势。

这项研究的重点是一个系统，该系统根据学生的资格、历史数据和经验来预测该学生是否会被安置。该预测器使用机器学习算法给出结果。

使用的算法是逻辑回归。逻辑回归基本上是一种监督分类算法。在分类问题中，目标变量（或输出）y 只能为给定的一组特征（或输入）X 取离散值。谈到数据集，它包含中学百分比、高中百分比、学位学生的百分比、学位和工作经验。在预测结果后，它的效率也基于数据集计算。此处使用的数据集采用.csv格式。

以下是分步方法：

第一步：导入需要的模块。

Python

# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Python

# reading the file
dataset = pd.read_csv('Placement_Data_Full_Class.csv')
dataset

Python

# dropping the serial no and salary col
dataset = dataset.drop('sl_no', axis=1)
dataset = dataset.drop('salary', axis=1)

Python

# catgorising col for further labelling
dataset["gender"] = dataset["gender"].astype('category')
dataset["ssc_b"] = dataset["ssc_b"].astype('category')
dataset["hsc_b"] = dataset["hsc_b"].astype('category')
dataset["degree_t"] = dataset["degree_t"].astype('category')
dataset["workex"] = dataset["workex"].astype('category')
dataset["specialisation"] = dataset["specialisation"].astype('category')
dataset["status"] = dataset["status"].astype('category')
dataset["hsc_s"] = dataset["hsc_s"].astype('category')
dataset.dtypes

Python

# labelling the columns
dataset["gender"] = dataset["gender"].cat.codes
dataset["ssc_b"] = dataset["ssc_b"].cat.codes
dataset["hsc_b"] = dataset["hsc_b"].cat.codes
dataset["degree_t"] = dataset["degree_t"].cat.codes
dataset["workex"] = dataset["workex"].cat.codes
dataset["specialisation"] = dataset["specialisation"].cat.codes
dataset["status"] = dataset["status"].cat.codes
dataset["hsc_s"] = dataset["hsc_s"].cat.codes
 
# display dataset
dataset

Python

# selecting the features and labels
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values
 
# display dependent variables
Y

Python

# dividing the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2)
 
# display dataset
dataset.head()

Python

# creating a classifier using sklearn
from sklearn.linear_model import LogisticRegression
 
clf = LogisticRegression(random_state=0, solver='lbfgs',
                         max_iter=1000).fit(X_train,
                                            Y_train)
# printing the acc
clf.score(X_test, Y_test)

Python

# predicting for random value
clf.predict([[0, 87, 0, 95, 0, 2, 78, 2, 0, 0, 1, 0]])

Python

# creating a Y_pred for test data
Y_pred = clf.predict(X_test)
 
# display predicted values
Y_pred

Python

# evaluation of the classifier
from sklearn.metrics import confusion_matrix, accuracy_score
 
# display confusion matrix
print(confusion_matrix(Y_test, Y_pred))
 
# display accuracy
print(accuracy_score(Y_test, Y_pred))

第 2 步：现在读取我们将用于分析的数据集，然后检查数据集。

Python

# reading the file
dataset = pd.read_csv('Placement_Data_Full_Class.csv')
dataset

输出：

第 3 步：现在我们将删除不需要的列。

Python

# dropping the serial no and salary col
dataset = dataset.drop('sl_no', axis=1)
dataset = dataset.drop('salary', axis=1)

第 4 步：现在在继续之前，我们需要预处理和转换我们的数据。为此，我们将在某些列上使用astype()方法并将数据类型更改为category 。

Python

# catgorising col for further labelling
dataset["gender"] = dataset["gender"].astype('category')
dataset["ssc_b"] = dataset["ssc_b"].astype('category')
dataset["hsc_b"] = dataset["hsc_b"].astype('category')
dataset["degree_t"] = dataset["degree_t"].astype('category')
dataset["workex"] = dataset["workex"].astype('category')
dataset["specialisation"] = dataset["specialisation"].astype('category')
dataset["status"] = dataset["status"].astype('category')
dataset["hsc_s"] = dataset["hsc_s"].astype('category')
dataset.dtypes

输出：

第 5 步：现在我们将在其中一些列上应用代码，将它们的文本值转换为数值。

Python

# labelling the columns
dataset["gender"] = dataset["gender"].cat.codes
dataset["ssc_b"] = dataset["ssc_b"].cat.codes
dataset["hsc_b"] = dataset["hsc_b"].cat.codes
dataset["degree_t"] = dataset["degree_t"].cat.codes
dataset["workex"] = dataset["workex"].cat.codes
dataset["specialisation"] = dataset["specialisation"].cat.codes
dataset["status"] = dataset["status"].cat.codes
dataset["hsc_s"] = dataset["hsc_s"].cat.codes
 
# display dataset
dataset

输出：

第 6 步：现在使用iloc()函数将数据集拆分为特征和值：

Python

# selecting the features and labels
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values
 
# display dependent variables
Y

输出：

第 7 步：现在我们将数据集拆分为训练和测试数据，这些数据将用于稍后检查效率。

Python

# dividing the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.2)
 
# display dataset
dataset.head()

输出：

第 8 步：现在我们需要训练我们需要导入文件的模型，然后我们将使用sklearn模块创建一个分类器。然后我们将检查模型的准确性。

Python

# creating a classifier using sklearn
from sklearn.linear_model import LogisticRegression
 
clf = LogisticRegression(random_state=0, solver='lbfgs',
                         max_iter=1000).fit(X_train,
                                            Y_train)
# printing the acc
clf.score(X_test, Y_test)

输出：

第 9 步：一旦我们训练了模型，我们将检查它并给出一些随机值：

Python

# predicting for random value
clf.predict([[0, 87, 0, 95, 0, 2, 78, 2, 0, 0, 1, 0]])

输出：

第 10 步：为了更细致地了解我们模型的性能，我们需要制作一个混淆矩阵。混淆矩阵是一个包含两行两列的表格，用于报告假阳性、假阴性、真阳性和真阴性的数量。

要获得混淆矩阵，它需要两个参数：测试集y_test的实际标签和预测标签。分类器的预测标签存储在y_pred中，如下所示：

Python

# creating a Y_pred for test data
Y_pred = clf.predict(X_test)
 
# display predicted values
Y_pred

输出：

第 11 步：最后，我们有了y_pred，因此我们可以生成混淆矩阵：

Python

# evaluation of the classifier
from sklearn.metrics import confusion_matrix, accuracy_score
 
# display confusion matrix
print(confusion_matrix(Y_test, Y_pred))
 
# display accuracy
print(accuracy_score(Y_test, Y_pred))

输出：