计算加权成对马氏距离
Calculating weigthed pairwise Mahalanobis distances
我知道 HDMD
包有一个名为 pairwise.mahalanobis
的函数,它应该计算成对的马氏距离。但是,我也想为这个距离引入权重,这个功能是不可行的。因此,我开发了自己的代码。为了测试它是否运行良好,我首先保持简单,即没有权重,并将其结果与 pairwise.mahalanobis
函数的结果进行比较。但是结果不匹配......下面是我使用的函数:
dist.maha <- function (X) {
diff = pair.diff(X) # pairwise difference of rows
V <- cov(X) ## empirical covariance; positive definite
L <- t(chol(V)) ## lower triangular factor
stdX <- t(forwardsolve(L, t(diff))) # solving the system of linear equations
return(stdX)
}
这是它在玩具数据上的两种选择的实现:
data = as.matrix(c(100, 54, 56, 79, 12))
dist_manuel = dist.maha(data)
# This is to convert dist_manuel from a vector to a distance matrix
ind_1 = vector(length = choose(nrow(data),2))
ind_2 = vector(length = choose(nrow(data),2))
k =1
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
ind_1[k] = i
ind_2[k] = j
k = k + 1
}
}
dist_manuel = cbind(ind_1,ind_2,dist_manuel)
dist_mat = matrix(data = NA, nrow = nrow(data), ncol = nrow(data))
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
dist_mat[i,j] = dist_manuel[which(dist_manuel[,1] == i & dist_manuel[,2] == j),3]
}
}
# This is the HDMD alternative
id = c(1,2,3,4,5)
data = cbind(id,data)
HDMD = pairwise.mahalanobis(as.data.frame(data[,2]), grouping = data[,1])
dist_HDMD = HDMD$distance
# The outputs
dist_HDMD
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 1 4 9 16
#[2,] 1 0 1 4 9
#[3,] 4 1 0 1 4
#[4,] 9 4 1 0 1
#[5,] 16 9 4 1 0
dist_mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA NA NA NA NA
#[2,] 1.4002541 NA NA NA NA
#[3,] 1.3393735 -0.06088061 NA NA NA
#[4,] 0.6392465 -0.76100768 -0.7001271 NA NA
#[5,] 2.6787470 1.27849290 1.3393735 2.039501 NA
pairwise.mahalanobi
s 函数的结果对我来说似乎完全荒谬。对于初学者,它为 data[2]
& data[3]
和 data[2]
& data[1]
分配了 1 的距离,这在查看它们的值时毫无意义。另一方面,我的功能给出了一致的结果。例如,让我们比较一下 data[1]
& data[2]
和 data[1]
& data[3]
.
之间的距离比
(100 - 54) / (100 - 56) = 46/44 = 1.045455
现在,这个比率也适用于我的函数产生的距离。
dist_mat[2,1]/dist_mat[3,1]
#[1] 1.045455
确实如此!这是否意味着 pairwise.mahalanobis
错误时我的函数运行良好? (或者我不知何故使用不正确?)我在 R 方面不是很有经验,所以我不敢自己得出这个结论。如果有比我更有经验的人能证实我的逻辑就好了。
您的 dist.maha
函数有错误。这很明显,因为它计算的一些距离是负数——所以它们不可能是实际距离!幸运的是,这很容易解决,因为您只需要对 stdX
向量求平方即可。
library("HDMD")
library("tidyverse")
# Convert a vector of pairwise distances from to a distance matrix
# (Simplified approach which doesn't use for-loops)
pairwise_dist_to_dist_matrix <- function(dist, n) {
stopifnot(length(dist) == n*(n-1)/2)
mat <- matrix(NA_real_, n, n)
diag(mat) <- 0
mat[lower.tri(mat)] <- dist
mat
}
dist.maha <- function (X) {
diff <- pair.diff(X) # pairwise difference of rows
V <- cov(X) # empirical covariance; positive definite
L <- t(chol(V)) # lower triangular factor
stdX <- t(forwardsolve(L, t(diff))) # solving the system of linear equations
dist <- stdX * stdX # Don't forget to square!
dist <- rowSums(dist) # And add up the differences in each dimension.
pairwise_dist_to_dist_matrix(dist, nrow(X))
}
# An alternative computation, for an additional check
dist.maha2 <- function(X) {
diff <- pair.diff(X)
V <- cov(X)
Vinv <- solve(V)
dist <- rowSums(diff %*% Vinv * diff)
pairwise_dist_to_dist_matrix(dist, nrow(X))
}
# Slightly more complex data matrix to check if
# functions work in higher dimensions
data <- matrix(c(100, 54, 56, 79, 12, 1, 2, 3, 4, 5), ncol = 2)
dist.maha(data)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 NA NA NA NA
#> [2,] 2.275210 0.0000000 NA NA NA
#> [3,] 1.974017 0.9742819 0.000000 NA NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000 NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146 0
dist.maha2(data)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 NA NA NA NA
#> [2,] 2.275210 0.0000000 NA NA NA
#> [3,] 1.974017 0.9742819 0.000000 NA NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000 NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146 0
看来您也没有正确使用pairwise.mahalanobis
。您必须计算并传递协方差矩阵(cov
参数)。
# This is the HDMD alternative
id = c(1,2,3,4,5)
# Incorrect:
# You have to specify the `cov` argument.
# Otherwise `pairwise.mahalanobis` doesn't compute it correctly
# as each sample is assumed to be in its own group.
pairwise.mahalanobis(data, grouping = id)$distance
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000 4345.2805 3840.7349 759.689 15362.940
#> [2,] 4345.280 0.0000 16.7591 1487.209 3369.708
#> [3,] 3840.735 16.7591 0.0000 1194.197 3840.735
#> [4,] 759.689 1487.2090 1194.1967 0.000 9310.174
#> [5,] 15362.940 3369.7076 3840.7349 9310.174 0.000
# Correct:
# NOTE: Ignore the warning message; there seems to be a small bug in `pairwise.mahalanobis`.
pairwise.mahalanobis(data, grouping = id, cov = cov(data))$distance
#> Warning in if (dim(cov) != c(p, p)) stop("cov matrix not of dim = (p,p)
#> \n"): the condition has length > 1 and only the first element will be used
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 2.2752099 1.9740168 4.759700 7.896067
#> [2,] 2.275210 0.0000000 0.9742819 7.584269 3.621388
#> [3,] 1.974017 0.9742819 0.0000000 3.250906 1.974017
#> [4,] 4.759700 7.5842687 3.2509059 0.000000 5.690146
#> [5,] 7.896067 3.6213875 1.9740168 5.690146 0.000000
由 reprex package (v0.2.1)
于 2019-03-24 创建
我知道 HDMD
包有一个名为 pairwise.mahalanobis
的函数,它应该计算成对的马氏距离。但是,我也想为这个距离引入权重,这个功能是不可行的。因此,我开发了自己的代码。为了测试它是否运行良好,我首先保持简单,即没有权重,并将其结果与 pairwise.mahalanobis
函数的结果进行比较。但是结果不匹配......下面是我使用的函数:
dist.maha <- function (X) {
diff = pair.diff(X) # pairwise difference of rows
V <- cov(X) ## empirical covariance; positive definite
L <- t(chol(V)) ## lower triangular factor
stdX <- t(forwardsolve(L, t(diff))) # solving the system of linear equations
return(stdX)
}
这是它在玩具数据上的两种选择的实现:
data = as.matrix(c(100, 54, 56, 79, 12))
dist_manuel = dist.maha(data)
# This is to convert dist_manuel from a vector to a distance matrix
ind_1 = vector(length = choose(nrow(data),2))
ind_2 = vector(length = choose(nrow(data),2))
k =1
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
ind_1[k] = i
ind_2[k] = j
k = k + 1
}
}
dist_manuel = cbind(ind_1,ind_2,dist_manuel)
dist_mat = matrix(data = NA, nrow = nrow(data), ncol = nrow(data))
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
dist_mat[i,j] = dist_manuel[which(dist_manuel[,1] == i & dist_manuel[,2] == j),3]
}
}
# This is the HDMD alternative
id = c(1,2,3,4,5)
data = cbind(id,data)
HDMD = pairwise.mahalanobis(as.data.frame(data[,2]), grouping = data[,1])
dist_HDMD = HDMD$distance
# The outputs
dist_HDMD
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 1 4 9 16
#[2,] 1 0 1 4 9
#[3,] 4 1 0 1 4
#[4,] 9 4 1 0 1
#[5,] 16 9 4 1 0
dist_mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA NA NA NA NA
#[2,] 1.4002541 NA NA NA NA
#[3,] 1.3393735 -0.06088061 NA NA NA
#[4,] 0.6392465 -0.76100768 -0.7001271 NA NA
#[5,] 2.6787470 1.27849290 1.3393735 2.039501 NA
pairwise.mahalanobi
s 函数的结果对我来说似乎完全荒谬。对于初学者,它为 data[2]
& data[3]
和 data[2]
& data[1]
分配了 1 的距离,这在查看它们的值时毫无意义。另一方面,我的功能给出了一致的结果。例如,让我们比较一下 data[1]
& data[2]
和 data[1]
& data[3]
.
(100 - 54) / (100 - 56) = 46/44 = 1.045455
现在,这个比率也适用于我的函数产生的距离。
dist_mat[2,1]/dist_mat[3,1]
#[1] 1.045455
确实如此!这是否意味着 pairwise.mahalanobis
错误时我的函数运行良好? (或者我不知何故使用不正确?)我在 R 方面不是很有经验,所以我不敢自己得出这个结论。如果有比我更有经验的人能证实我的逻辑就好了。
您的 dist.maha
函数有错误。这很明显,因为它计算的一些距离是负数——所以它们不可能是实际距离!幸运的是,这很容易解决,因为您只需要对 stdX
向量求平方即可。
library("HDMD")
library("tidyverse")
# Convert a vector of pairwise distances from to a distance matrix
# (Simplified approach which doesn't use for-loops)
pairwise_dist_to_dist_matrix <- function(dist, n) {
stopifnot(length(dist) == n*(n-1)/2)
mat <- matrix(NA_real_, n, n)
diag(mat) <- 0
mat[lower.tri(mat)] <- dist
mat
}
dist.maha <- function (X) {
diff <- pair.diff(X) # pairwise difference of rows
V <- cov(X) # empirical covariance; positive definite
L <- t(chol(V)) # lower triangular factor
stdX <- t(forwardsolve(L, t(diff))) # solving the system of linear equations
dist <- stdX * stdX # Don't forget to square!
dist <- rowSums(dist) # And add up the differences in each dimension.
pairwise_dist_to_dist_matrix(dist, nrow(X))
}
# An alternative computation, for an additional check
dist.maha2 <- function(X) {
diff <- pair.diff(X)
V <- cov(X)
Vinv <- solve(V)
dist <- rowSums(diff %*% Vinv * diff)
pairwise_dist_to_dist_matrix(dist, nrow(X))
}
# Slightly more complex data matrix to check if
# functions work in higher dimensions
data <- matrix(c(100, 54, 56, 79, 12, 1, 2, 3, 4, 5), ncol = 2)
dist.maha(data)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 NA NA NA NA
#> [2,] 2.275210 0.0000000 NA NA NA
#> [3,] 1.974017 0.9742819 0.000000 NA NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000 NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146 0
dist.maha2(data)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 NA NA NA NA
#> [2,] 2.275210 0.0000000 NA NA NA
#> [3,] 1.974017 0.9742819 0.000000 NA NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000 NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146 0
看来您也没有正确使用pairwise.mahalanobis
。您必须计算并传递协方差矩阵(cov
参数)。
# This is the HDMD alternative
id = c(1,2,3,4,5)
# Incorrect:
# You have to specify the `cov` argument.
# Otherwise `pairwise.mahalanobis` doesn't compute it correctly
# as each sample is assumed to be in its own group.
pairwise.mahalanobis(data, grouping = id)$distance
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000 4345.2805 3840.7349 759.689 15362.940
#> [2,] 4345.280 0.0000 16.7591 1487.209 3369.708
#> [3,] 3840.735 16.7591 0.0000 1194.197 3840.735
#> [4,] 759.689 1487.2090 1194.1967 0.000 9310.174
#> [5,] 15362.940 3369.7076 3840.7349 9310.174 0.000
# Correct:
# NOTE: Ignore the warning message; there seems to be a small bug in `pairwise.mahalanobis`.
pairwise.mahalanobis(data, grouping = id, cov = cov(data))$distance
#> Warning in if (dim(cov) != c(p, p)) stop("cov matrix not of dim = (p,p)
#> \n"): the condition has length > 1 and only the first element will be used
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000 2.2752099 1.9740168 4.759700 7.896067
#> [2,] 2.275210 0.0000000 0.9742819 7.584269 3.621388
#> [3,] 1.974017 0.9742819 0.0000000 3.250906 1.974017
#> [4,] 4.759700 7.5842687 3.2509059 0.000000 5.690146
#> [5,] 7.896067 3.6213875 1.9740168 5.690146 0.000000
由 reprex package (v0.2.1)
于 2019-03-24 创建