R 编程中的自举

Bootstrapping是一种用于推理统计的技术，用于一次又一次地构建单个数据集的随机样本。自举允许计算抽样的测量值，例如平均值、中位数、众数、置信区间等。

R - 自举

以下是 R 编程语言中的引导过程：

选择引导样本的数量。
选择每个样本的大小。
对于每个样本，如果样本的大小小于所选样本，则从数据集中随机选择一个观察值并将其添加到样本中。
测量样本的统计量。
测量所有计算样本值的平均值。

自举方法

有两种引导方法：

残差重采样：此方法也称为基于模型的重采样。该方法假设模型是正确的，误差是独立的且分布相同。每次重采样后，都会重新定义变量，并使用新变量来衡量新的因变量。
Bootstrap Pairs：在这种方法中，因变量和自变量一起作为对进行采样。

自举中的置信区间类型

置信区间（CI）是统计中的样本数据计算的一种计算值。它会产生一定范围的值或确定真实值所在的区间。自举有 5 种类型的置信区间，如下所示：

基本：它也称为反向百分位数，是使用引导数据分布的分位数生成的。数学上，

$\left(2 \widehat{\theta}-\theta_{(1-\alpha / 2)}^{*}, 2 \widehat{\theta}-\theta_{(\alpha / 2)}^{*}\right)$

where,
$\alpha$ represents confidence interval, mostly $\alpha = 0.95$
$\theta^{*}$ represents bootstrapped coefficients
$\theta_{(1-\alpha / 2)}^{*}$ represents $1-\alpha / 2$ percentile of bootstrapped coefficients

编程需要懂一点英语

正常：正常 CI 在数学上给出为，

$\begin{array}{c} t_{0}-b \pm Z_{\alpha} \cdot \mathrm{se}^{*} \\ 2 t_{0}-t^{*} \pm Z_{\alpha} \cdot \mathrm{se}^{*} \end{array}$
where,

$t_{0}$ represents a value from dataset t
b is the bias of bootstrap estimate i.e.,

$\mathbf{b}=\mathbf{t}^{*}-\mathbf{t}_{\mathrm{o}}$
$Z_{\alpha}$ represents $1-\alpha / 2$ quantile of bootstrap distribution
$se^{*}$ represents standard error of $t^{*}$

编程需要懂一点英语

Stud：在学生化 CI 中，数据以中心为 0 且标准差为 1 来校正分布偏斜进行归一化。
Perc – 百分位 CI 与基本 CI 类似，但公式不同，

$\left(\theta_{(\alpha / 2)}^{*}, \theta_{(1-\alpha / 2)}^{*}\right)$

编程需要懂一点英语

BCa：此方法可针对偏差和偏度进行调整，但在异常值极端时可能不稳定。数学上，

$\left(\theta_{0}+\frac{\theta_{0}+\theta_{\alpha}}{1-a\left(\theta_{0}-\theta_{\alpha}\right)}, \theta_{0}+\frac{\theta_{0}+\theta_{(1-\alpha)}}{1-a\left(\theta_{0}-\theta_{(1-\alpha)}\right)}\right)$

编程需要懂一点英语

在 R 编程中执行引导的语法如下：

Syntax: boot(data, statistic, R)

Parameters:

data represents dataset
statistic represents statistic functions to be performed on dataset
R represents number of samples

编程需要懂一点英语

要了解更多 boot()函数的可选参数，请使用以下命令：

help("boot")

例子：

R

# Library required for boot() function
install.packages("boot")
 
# Load the library
library(boot)
 
# Creating a function to pass into boot() function
bootFunc <- function(data, i){
df <- data[i, ]
c(cor(df[, 2], df[, 3]),
    median(df[, 2]),
    mean(df[, 1])
)
}
 
b <- boot(mtcars, bootFunc, R = 100)
 
print(b)
 
# Show all CI values
boot.ci(b, index = 1)

输出：

ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = bootFunc, R = 100)


Bootstrap Statistics :
      original       bias    std. error
t1*  0.9020329 -0.002195625  0.02104139
t2*  6.0000000  0.340000000  0.85540468
t3* 20.0906250 -0.110812500  0.96052824


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates

CALL : 
boot.ci(boot.out = b, index = 1)

Intervals : 
Level      Normal              Basic         
95%   ( 0.8592,  0.9375 )   ( 0.8612,  0.9507 )  

Level     Percentile            BCa          
95%   ( 0.8534,  0.9429 )   ( 0.8279,  0.9280 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
Some percentile intervals may be unstable
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable
Warning messages:
1: In boot.ci(b, index = 1) :
  bootstrap variances needed for studentized intervals
2: In norm.inter(t, adj.alpha) :
  extreme order statistics used as endpoints