R 编程中的夏皮罗-威尔克测试

夏皮罗-威尔克检验或夏皮罗检验是频率统计中的正态性检验。夏皮罗检验的原假设是总体呈正态分布。它是为检测各种偏离常态而设计的三种常态测试之一。如果 p 的值等于或小于 0.05，则正态性假设将被夏皮罗检验拒绝。失败时，测试可以声明数据不会以 95% 的置信度正常拟合分布。但是，在通过时，测试可以表明不存在明显偏离常态的情况。这个测试可以在 R 编程中很容易地完成。

夏皮罗-威尔克的测试公式

假设一个样本，比如 x ₁ ,x ₂ …….x _n ，来自一个正态分布的总体。然后根据夏皮罗-威尔克检验零假设检验

$W=\frac{(\sum_{i=1}^n a_ix_{(i)})^2}{(\sum_{i=1}^n x_i - \bar{x})^2}$

where,

x_(i): it is the ith smallest number in the given sample.
mean(x) : ( x₁+x₂+……+x_n) / n i.e the sample mean.
a_i: coefficient that can be calculated as (a₁,a₂,….,a_n) = (m^TV^-1)/C . Here V is the covariance matrix, m and C are the vector norms that can be calculated as C= || V^-1 m || and m = (m₁, m₂,……, m_n ).

编程需要懂一点英语

R中的实现

为了执行夏皮罗威尔克测试，R 提供了shapiro.test()函数。

Syntax:

shapiro.test(x)

Parameter:

x : a numeric vector containing the data values. It allows missing values but the number of missing values should be of the range 3 to 5000.

编程需要懂一点英语

让我们看看如何逐步执行夏皮罗威尔克的测试。

第 1 步：首先安装所需的软件包。执行测试所需的两个包是dplyr 。 dplyr 包是高效数据操作所必需的。可以通过以下方式从 R 控制台安装软件包：

install.packages("dplyr")

第 2 步：现在将已安装的包加载到 R 脚本中。可以通过以下方式使用library()函数来完成。

R

# loading the package
library(dplyr)

R

# loading the package
library("dplyr")
  
# Using the ToothGrowth data set
# loading the data set
my_data <- ToothGrowth

R

# loading the package
library("dplyr")
  
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
  
# Using the set.seed() for 
# random number generation
set.seed(1234)
  
# Using the sample_n() for 
# random sample of 10 rows
dplyr::sample_n(my_data, 10)

R

# loading the package
library("dplyr")
  
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
  
# Using the set.seed() 
# for random number generation
set.seed(1234)
  
# Using the sample_n() 
# for random sample of 10 rows
dplyr::sample_n(my_data, 10)
  
# Using the shapiro.test() to check
# for normality based 
# on the len parameter
shapiro.test(my_data$len)

第三步：最重要的任务是选择一个合适的数据集。在这里，让我们使用ToothGrowth数据集。它是 R 库中的内置数据集。

R

# loading the package
library("dplyr")
  
# Using the ToothGrowth data set
# loading the data set
my_data <- ToothGrowth

也可以创建自己的数据集。为此，首先准备数据，然后保存文件，然后将数据集导入脚本。该文件可以包括使用以下语法：

data <- read.delim(file.choose()) ,if the format of the file is .txt
data <- read.csv(file.choose()), if the format of the file is .csv

第 4 步：现在使用set.seed()函数选择一个随机数。随后我们开始显示使用 dplyr 包的sample_n()函数随机选择的 10 行的输出样本。这就是我们检查数据的方式。

R

# loading the package
library("dplyr")
  
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
  
# Using the set.seed() for 
# random number generation
set.seed(1234)
  
# Using the sample_n() for 
# random sample of 10 rows
dplyr::sample_n(my_data, 10)

输出：

len supp dose
1  11.2   VC  0.5
2   8.2   OJ  0.5
3  10.0   OJ  0.5
4  27.3   OJ  2.0
5  14.5   OJ  1.0
6  26.4   OJ  2.0
7   4.2   VC  0.5
8  15.2   VC  1.0
9  14.5   OJ  0.5
10 26.7   VC  2.0

第 5 步：最后使用shapiro.test()函数执行夏皮罗威尔克测试。

R

# loading the package
library("dplyr")
  
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
  
# Using the set.seed() 
# for random number generation
set.seed(1234)
  
# Using the sample_n() 
# for random sample of 10 rows
dplyr::sample_n(my_data, 10)
  
# Using the shapiro.test() to check
# for normality based 
# on the len parameter
shapiro.test(my_data$len)

输出：

> dplyr::sample_n(my_data, 10)
    len supp dose
1  11.2   VC  0.5
2   8.2   OJ  0.5
3  10.0   OJ  0.5
4  27.3   OJ  2.0
5  14.5   OJ  1.0
6  26.4   OJ  2.0
7   4.2   VC  0.5
8  15.2   VC  1.0
9  14.5   OJ  0.5
10 26.7   VC  2.0
> shapiro.test(my_data$len)

    Shapiro-Wilk normality test

data:  my_data$len
W = 0.96743, p-value = 0.1091

根据获得的输出，我们可以假设正常。 p 值大于 0.05。因此，给定数据的分布与正态分布没有显着差异。