毫升 |多元线性回归（反向消除技术）

多元线性回归是一种回归类型，其中模型取决于多个自变量（而不是在简单线性回归的情况下仅取决于一个自变量）。多元线性回归有几种构建有效模型的技术，即：

全押
向后消除
前向选择
双向消除

在本文中，我们将使用后向消除技术实现多元线性回归。
向后消除包括以下步骤：

选择一个显着性水平以留在模型中（例如 SL = 0.05）
用所有可能的预测变量拟合模型
考虑具有最高 P 值的预测变量。如果 P>SL，则转到点 d。
删除预测器
拟合没有这个变量的模型并重复步骤 c 直到条件变为假。

假设我们有一个数据集，其中包含一组不同公司的支出信息。我们想知道每个公司的利润，以确定与他们合作的公司可以提供最好的结果。我们使用逐步方法构建回归模型。

第 1 步：基本预处理和编码

# import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
  
# import the dataset
df = pd.read_csv('50_Startups.csv')
  
# first five entries of the dataset
df.head()
  
# split the dataframe into dependent and independent variables. 
x = df[['R&D Spend', 'Administration', 'Marketing Spend', 'State']]
y = df['Profit']
x.head()
y.head()
  
# since the state is a string datatype column we need to encode it.
x = pd.get_dummies(x)
x.head()

数据集

对状态列进行编码后的自变量集

第 2 步：将数据拆分为训练集和测试集并进行预测

x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size = 0.3, random_state = 0)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(x_train, y_train)
pred = lm.predict(x_test)

我们可以看到我们的预测与测试集足够接近，但是我们如何找到对利润有贡献的最重要因素。
这是一个解决方案。
我们知道多元线性回归线的方程为y=b1+b2*x+b3*x'+b4*x”+……。
其中 b1, b2, b3, ... 是系数，x, x', x" 都是自变量。
由于我们没有第一个系数的任何“x”，我们假设它可以写为 b 和 1 的乘积，因此我们附加一列。有一些库可以处理它，但由于我们使用的是 stats 模型库，因此我们需要显式添加该列。
第 3 步：使用向后消除技术

import statsmodels.regression.linear_model as sm
# add a column of ones as integer data type
x = np.append(arr = np.ones((50, 1)).astype(int), 
              values = x, axis = 1)
# choose a Significance level usually 0.05, if p>0.05
#  for the highest values parameter, remove that value
x_opt = x[:, [0, 1, 2, 3, 4, 5]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()

此图显示了最高值的参数

现在我们按照向后消除的步骤开始消除不必要的参数。

# remove the 4th column as it has the highest value
x_opt = x[:, [0, 1, 2, 3, 5]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()
  
# remove the 5th column as it has the highest value
x_opt = x[:, [0, 1, 2, 3]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()
  
# remove the 3rd column as it has the highest value
x_opt = x[:, [0, 1, 2]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()
  
# remove the 2nd column as it has the highest value
x_opt = x[:, [0, 1]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()

删除第一个不必要的参数后的摘要。

所以如果我们继续这个过程，我们会看到最后只剩下一列，那就是研发支出。我们可以得出结论，研发支出最大的公司利润最高。

这样，我们就解决了寻找合作公司的问题陈述。现在让我们简单地看一下OLS摘要的参数。

R 方- 它讲述了拟合的优劣。它的范围在 0 到 1 之间。值越接近 1 越好。它解释了模型中因变量的变化程度。然而，它以永远不会减少的方式存在偏差（即使在添加变量时）。
Adj Rsquare——这个参数有一个惩罚因子（回归变量的数量），随着自变量数量的增加，它总是减少或保持与之前的值相同。如果它的值在删除不必要的参数时不断增加，则继续模型或停止并恢复。
F 统计量——用于比较两个方差，并且总是大于 0。它被公式化为 v1 ² /v2 ² 。在回归中，它是模型的已解释方差与未解释方差之比。
AIC 和 BIC – AIC 代表 Akaike 信息准则，BIC 代表贝叶斯信息准则这两个参数都取决于似然函数L。
Skew – 告知关于均值的数据对称性。
峰度- 它测量分布的形状，即接近均值的数据量比远离均值的数据量。
综合- D'Angostino 的测试。它为偏度和峰度的存在提供了组合统计检验。
对数似然- 它是似然函数的对数。

此图显示了参数的首选相对值。
参考资料：- Kirill Eremenko 和 Hadelin de Ponteves 的机器学习课程。