如何计算R中的马氏距离?
在本文中,我们将在 R 编程语言中计算马氏距离。
马氏距离用于计算多元距离度量空间中两点或向量之间的距离,这是一种涉及多个变量的统计分析。首先,我们需要一个数据框。
示例:创建数据框
R
set.seed(700)
score_1 <− rnorm(20,12,1)
score_2 <− rnorm(20,11,12)
score_3 <− rnorm(20,15,23)
score_4 <− rnorm(20,16,3)
df <− data.frame(score_1, score_2, score_3, score_4)
df
R
mahalanobis(df, colMeans(df), cov(df))
R
# create new column for Mahalanobis distances
df$mahalnobis<- mahalanobis(df, colMeans(df), cov(df))
df
R
# create new column for p-value
df$pvalue <- pchisq(df$mahalnobis, df=3)
df
输出:
score_1 score_2 score_3 score_4
1 11.91218 20.3843568 68.179655 12.864159
2 11.77103 13.5718323 -30.953642 15.241168
3 11.91570 29.9250800 42.570528 7.179686
4 10.25905 10.7594514 17.879960 19.639647
5 13.01343 15.7463448 3.185857 12.776482
6 11.78211 14.9688992 31.368892 16.043620
7 13.51328 10.5017826 58.985715 14.701817
8 11.10565 20.4965614 6.806652 15.876947
9 11.20834 12.7588547 10.461229 16.991393
10 11.10233 -10.3961351 18.082209 15.258644
11 12.34732 -0.8615359 57.411750 13.400421
12 12.08361 15.0248600 -17.853098 13.999682
13 12.86457 -6.1221908 23.184838 20.389762
14 10.58871 17.1000715 20.900155 12.560962
15 10.74134 6.3728076 39.173259 17.865589
16 11.20248 8.8909128 24.696939 14.384012
17 12.89797 34.8522136 10.035498 14.975053
18 11.37993 14.4232355 28.129197 16.395271
19 11.78309 14.9324201 23.584362 14.765245
20 12.77480 30.7969171 -9.635902 10.203178
mahalanobis()函数用于计算 R 中的 Mahalanobis 距离。它是内置类型。
Syntax: mahalanobis(Data , center, cov)
where:
- Data: matrix or vector of data
- center: mean vector
- cov: covariance matrix
示例:计算马氏距离
R
mahalanobis(df, colMeans(df), cov(df))
输出:
4.46866714558536 4.61260586529474 7.41513071619846 5.21448589688871
2.84292222223026 0.673116763926688 6.04984394951585 1.72865361097932
1.03750690527476 7.21856549018804 4.85579110162481 2.90808365141091
7.57223884458172 3.27702692226183 2.68208130355785 0.916110244005359
6.79796970070888 0.829693729587342 0.0356208551487593 4.86388508103035
计算每一行的 Mahalanobis
基于 Mahalanobis 距离,我们发现一些距离远高于其他距离,为了确定其具有统计学意义,我们需要计算 p 值。
示例:计算每一行的马氏距离
R
# create new column for Mahalanobis distances
df$mahalnobis<- mahalanobis(df, colMeans(df), cov(df))
df
输出:
score_1 score_2 score_3 score_4
1 11.91218 20.3843568 68.179655 12.864159
2 11.77103 13.5718323 -30.953642 15.241168
3 11.91570 29.9250800 42.570528 7.179686
4 10.25905 10.7594514 17.879960 19.639647
5 13.01343 15.7463448 3.185857 12.776482
6 11.78211 14.9688992 31.368892 16.043620
7 13.51328 10.5017826 58.985715 14.701817
8 11.10565 20.4965614 6.806652 15.876947
9 11.20834 12.7588547 10.461229 16.991393
10 11.10233 -10.3961351 18.082209 15.258644
11 12.34732 -0.8615359 57.411750 13.400421
12 12.08361 15.0248600 -17.853098 13.999682
13 12.86457 -6.1221908 23.184838 20.389762
14 10.58871 17.1000715 20.900155 12.560962
15 10.74134 6.3728076 39.173259 17.865589
16 11.20248 8.8909128 24.696939 14.384012
17 12.89797 34.8522136 10.035498 14.975053
18 11.37993 14.4232355 28.129197 16.395271
19 11.78309 14.9324201 23.584362 14.765245
20 12.77480 30.7969171 -9.635902 10.203178
计算 p 值
每个距离的 p 值计算为具有 k-1(k = 变量数)度的 Mahalanobis 距离的卡方统计量。
pchisq()函数用于计算累积卡方密度。
Syntax: pchisq(vec, df)
Parameters:
- vec: Vector of x-values
- df: Degree of Freedom
示例:计算 p 值
R
# create new column for p-value
df$pvalue <- pchisq(df$mahalnobis, df=3)
df
输出:
通常,小于 0.001 的 p 值被认为是异常值。在这种情况下,所有 p 值都大于 0.001