如何在Python中计算学生化残差?
学生化残差是一个统计术语,它被定义为通过将残差除以其估计的标准偏差获得的商。这是用于检测轮廓的关键技术。实际上,可以声称数据集中具有大于 3(绝对值)的学生化残差的任何类型的观察都是异常值。
我们的系统中应该已经安装了以下Python库:
- 熊猫
- 麻木的
- 统计模型
您可以在终端上使用以下命令在系统上安装这些软件包。
pip3 install pandas numpy statsmodels matplotlib
在Python中计算学生化残差的步骤
第 1 步:导入库。
我们需要在上面安装的程序中导入库。
Python3
# Importing necessary packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
Python3
# Creating dataframe
dataframe = pd.DataFrame({'Score': [80, 95, 80, 78, 84,
96, 86, 75, 97, 89],
'Benchmark': [27, 28, 18, 18, 29, 30,
25, 25, 24, 29]})
Python3
# Building simple linear regression model
simple_regression_model = ols('Score ~ Benchmark', data=dataframe).fit()
Python3
# Producing studenterized residual
stud_res = simple_regression_model.outlier_test()
Python3
# Python program to calculate studenterized residual
# Importing necessary packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
# Creating dataframe
dataframe = pd.DataFrame({'Score': [80, 95, 80, 78, 84,
96, 86, 75, 97, 89],
'Benchmark': [27, 28, 18, 18, 29, 30,
25, 25, 24, 29]})
# Building simple linear regression model
simple_regression_model = ols('Score ~ Benchmark', data=dataframe).fit()
# Producing studenterized residual
result = simple_regression_model.outlier_test()
print(result)
Python3
# Python program to draw the plot
# of stundenterized resiual
# Importing necessary packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
# Creating dataframe
dataframe = pd.DataFrame({'Score': [80, 95, 80, 78, 84,
96, 86, 75, 97, 89],
'Benchmark': [27, 28, 18, 18, 29, 30,
25, 25, 24, 29]})
# Building simple linear regression model
simple_regression_model = ols('Score ~ Benchmark', data=dataframe).fit()
# Producing studenterized residual
result = simple_regression_model.outlier_test()
# Defining predictor variable values and
# studentized residuals
x = dataframe['Score']
y = result['student_resid']
# Creating a scatterplot of predictor variable
# vs studentized residuals
plt.scatter(x, y)
plt.axhline(y=0, color='black', linestyle='--')
plt.xlabel('Points')
plt.ylabel('Studentized Residuals')
# Save the plot
plt.savefig("Plot.png")
第 2 步:创建数据框。
首先,我们需要创建一个数据框。借助 pandas 的包,我们可以创建一个数据框。片段如下,
Python3
# Creating dataframe
dataframe = pd.DataFrame({'Score': [80, 95, 80, 78, 84,
96, 86, 75, 97, 89],
'Benchmark': [27, 28, 18, 18, 29, 30,
25, 25, 24, 29]})
第三步:建立一个简单的线性回归模型。
现在我们需要为创建的数据集建立一个简单的线性回归模型。为了拟合简单的线性回归模型, Python提供了 statsmodels 包中的 ols()函数。
Syntax:
statsmodels.api.OLS(y, x)
Parameters:
- y : It represents the variable that depends on x
- x :It represents independent variable
例子:
Python3
# Building simple linear regression model
simple_regression_model = ols('Score ~ Benchmark', data=dataframe).fit()
第 4 步:产生学生化残差。
为了生成包含数据集中每个观察的学生化残差的数据帧,我们可以使用 outlier_test()函数。
句法:
simple_regression_model.outlier_test()
This function will produce a dataFrame that would contain the studentized residuals for each observation in the dataset
Python3
# Producing studenterized residual
stud_res = simple_regression_model.outlier_test()
下面是完整的实现。
Python3
# Python program to calculate studenterized residual
# Importing necessary packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
# Creating dataframe
dataframe = pd.DataFrame({'Score': [80, 95, 80, 78, 84,
96, 86, 75, 97, 89],
'Benchmark': [27, 28, 18, 18, 29, 30,
25, 25, 24, 29]})
# Building simple linear regression model
simple_regression_model = ols('Score ~ Benchmark', data=dataframe).fit()
# Producing studenterized residual
result = simple_regression_model.outlier_test()
print(result)
输出:
输出是一个数据框,其中包含:
- 学生化残差
- 学生化残差的未调整 p 值
- 学生化残差的 Bonferroni 校正 p 值
我们可以看到数据集中第一个观测值的学生化残差为 -1.121201,第二个观测值的学生化残差为 0.954871,以此类推。
可视化:
现在让我们进入学生化残差的可视化。在 metaplotlib 的帮助下,我们可以绘制预测变量值 VS 对应的学生化残差的图。
例子:
Python3
# Python program to draw the plot
# of stundenterized resiual
# Importing necessary packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
# Creating dataframe
dataframe = pd.DataFrame({'Score': [80, 95, 80, 78, 84,
96, 86, 75, 97, 89],
'Benchmark': [27, 28, 18, 18, 29, 30,
25, 25, 24, 29]})
# Building simple linear regression model
simple_regression_model = ols('Score ~ Benchmark', data=dataframe).fit()
# Producing studenterized residual
result = simple_regression_model.outlier_test()
# Defining predictor variable values and
# studentized residuals
x = dataframe['Score']
y = result['student_resid']
# Creating a scatterplot of predictor variable
# vs studentized residuals
plt.scatter(x, y)
plt.axhline(y=0, color='black', linestyle='--')
plt.xlabel('Points')
plt.ylabel('Studentized Residuals')
# Save the plot
plt.savefig("Plot.png")
输出:
绘图.png: