R编程中GPU的距离矩阵
距离测量是统计分析中的重要工具。它量化了用于数值计算的样本数据之间的差异。距离度量的一种流行选择是欧几里得距离,它是属性差异的平方和的平方根。特别地,对于具有 n 个数值属性的两个数据点 p 和 q,它们之间的欧几里德距离为:
可用的距离度量是(为两个向量 x 和 y 编写)
- 欧几里得:两个向量之间的通常距离(2 个范数,又名 L 2 ): √∑ i (x i −y i ) 2
- 最大值:x 和 y 的两个分量之间的最大距离(最高范数)
- 曼哈顿:两个向量之间的绝对距离(1 范数又名 L1), ∑N i =1|P i −Q i |
- 堪培拉:从总和中省略分子和分母为零的项,并将其视为缺失值: ∑ i |x i −y i |/(|x i |+|y i |)
- 二进制(又名非对称二进制):向量被视为二进制位,因此非零元素为“开”,零元素为“关”。距离是其中至少一个打开的位中唯一一个打开的位的比例。
- Minkowski : p 范数,分量之差的 p次幂之和的 p次根: ∑N i =1|P i −Q i |p)1/p
在 R 中的实现
为了在 R 编程中通过 GPU 计算距离矩阵,我们可以使用dist()函数。 dist()函数计算并返回通过使用指定的距离度量计算数据矩阵行之间的距离而计算出的距离矩阵。
Syntax:
dist(x, method = “euclidean”, diag = FALSE, upper = FALSE, p = 2)
Parameters:
x: a numeric matrix, data frame or “dist” object
method: the distance measure to be used. This must be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. Any unambiguous substring can be given.
diag: logical value indicating whether the diagonal of the distance matrix should be printed by print.dist.
upper: logical value indicating whether the upper triangle of the distance matrix should be printed by print.dist.
p: The power of the Minkowski distance
例子
R
# number of rows should be a multiple of rnorm
x <- matrix(rnorm(150), nrow = 5)
dist(x)
dist(x, diag = TRUE)
dist(x, upper = TRUE)
m <- as.matrix(dist(x))
d <- as.dist(m)
stopifnot(d == dist(x))
# showing all the six distance measures
x <- c(0, 0, 1, 1, 1, 1)
y <- c(1, 0, 1, 1, 0, 1)
dist(rbind(x, y), method = "binary")
dist(rbind(x, y), method = "canberra")
dist(rbind(x, y), method = "manhattan")
dist(rbind(x, y), method = "euclidean")
dist(rbind(x, y), method = "maximum")
dist(rbind(x, y), method = "minkowski")
输出:
> dist(x)
1 2 3 4
2 6.772630
3 7.615303 7.390410
4 6.460424 6.759275 7.773421
5 6.551426 7.688254 7.886380 7.039102
> dist(x, diag = TRUE)
1 2 3 4 5
1 0.000000
2 6.772630 0.000000
3 7.615303 7.390410 0.000000
4 6.460424 6.759275 7.773421 0.000000
5 6.551426 7.688254 7.886380 7.039102 0.000000
> dist(x, upper = TRUE)
1 2 3 4 5
1 6.772630 7.615303 6.460424 6.551426
2 6.772630 7.390410 6.759275 7.688254
3 7.615303 7.390410 7.773421 7.886380
4 6.460424 6.759275 7.773421 7.039102
5 6.551426 7.688254 7.886380 7.039102
> dist(rbind(x, y), method = "binary")
x
y 0.4
> dist(rbind(x, y), method = "canberra")
x
y 2.4
> dist(rbind(x, y), method = "manhattan")
x
y 2
> dist(rbind(x, y), method = "euclidean")
x
y 1.414214
> dist(rbind(x, y), method = "maximum")
x
y 1
> dist(rbind(x, y), method = "minkowski")
x
y 1.414214