毫升 |逻辑回归与决策树分类
我们可以在不同的类别上比较这两种算法——Criteria Logistic Regression Decision Tree Classification Interpretability Less interpretable More interpretable Decision Boundaries Linear and single decision boundary Bisects the space into smaller spaces Ease of Decision Making A decision threshold has to be set Automatically handles decision making Overfitting Not prone to overfitting Prone to overfitting Robustness to noise Robust to noise Majorly affected by noise Scalability Requires a large enough training set Can be trained on a small training set
第 1 步:导入所需的库
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
cd C:\Users\Dev\Desktop\Kaggle\Sinking Titanic
# Changing the working location to the location of the file
df = pd.read_csv('_train.csv')
y = df['Survived']
X = df.drop('Survived', axis = 1)
X = X.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis = 1)
X = X.replace(['male', 'female'], [2, 3])
# Hot-encoding the categorical variables
X.fillna(method ='ffill', inplace = True)
# Handling the missing values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.3, random_state = 0)
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(lr.score(X_test, y_test))
criteria = ['gini', 'entropy']
scores = {}
for c in criteria:
dt = DecisionTreeClassifier(criterion = c)
dt.fit(X_train, y_train)
test_score = dt.score(X_test, y_test)
scores = test_score
第 2 步:读取和清理数据集
cd C:\Users\Dev\Desktop\Kaggle\Sinking Titanic
# Changing the working location to the location of the file
df = pd.read_csv('_train.csv')
y = df['Survived']
X = df.drop('Survived', axis = 1)
X = X.drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis = 1)
X = X.replace(['male', 'female'], [2, 3])
# Hot-encoding the categorical variables
X.fillna(method ='ffill', inplace = True)
# Handling the missing values
第 3 步:训练和评估逻辑回归模型
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.3, random_state = 0)
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(lr.score(X_test, y_test))
第 4 步:训练和评估决策树分类器模型
criteria = ['gini', 'entropy']
scores = {}
for c in criteria:
dt = DecisionTreeClassifier(criterion = c)
dt.fit(X_train, y_train)
test_score = dt.score(X_test, y_test)
scores = test_score