每组多个观测值的马氏距离
Mahalanobis distance with multiple observations per group
我想计算 Mahalanobis distance 物种组,其中:
- i) 有两个以上的群体(两个以上的物种)。
- ii) 有多个变量(此类物种的特征)需要考虑。
- iii) 每组有多个观察值(在数据框中,这意味着每个物种有不止一行)。
我试图了解如何在这种情况下 运行 R 中的 mahalanobis 函数。这个问题类似于:
Mahalanobis distance on R for more than 2 groups
但在那里,只使用了一个变量。多于一个变量怎么办?
下面有一个例子,我相信它重现了我的实际数据。
Sp. X1 X2 X3
A 0.7 11 215
B 0.8 7 214
B 0.8 6.5 187
C 0.3 4 456
D 0.4 3 111
A 0.1 7 205
A 0.2 7 196
C 0.1 9.3 77
D 0.6 8 135
D 0.8 4 167
B 0.4 6 228
C 0.1 5 214
A 0.4 7 156
C 0.5 2 344
Sp。 = 实物; X1、X2 和 X3 是观测变量。
在真实的数据集中,有50多个物种,观察的数量各不相同(从100 rows/specie到1000)。
这就是我得到的,使用 HDMD 包中的 pairwise.mahalanobis
函数:
#data
a = structure(list(Sp = structure(c(1L, 2L, 2L, 3L, 4L, 1L, 1L, 3L,4L, 4L, 2L, 3L, 1L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
X1 = c(0.7, 0.8, 0.8, 0.3, 0.4, 0.1, 0.2, 0.1, 0.6, 0.8,0.4, 0.1, 0.4, 0.5),
X2 = c(11, 7, 6.5, 4, 3, 7, 7, 9.3,8, 4, 6, 5, 7, 2),
X3 = c(215L, 214L, 187L, 456L, 111L, 205L,196L, 77L, 135L, 167L, 228L, 214L, 156L, 344L)),
.Names = c("Sp","X1", "X2", "X3"),
row.names = c(NA, -14L),
class = "data.frame")
library(HDMD) #pairwise.mahalanobis function
library(cluster) #agnes function
group = matrix(a$Sp) #what is being compared
group = t(group[,1]) #prepare for pairwise.mahalanobis function
variables = c("X1","X2","X3") #variables (what is being used for comparison)
variables = as.matrix(a[,variables]) #prepare for pairwise.mahalanobis function
mahala_sq = pairwise.mahalanobis(x=variables, grouping=group) #get squared mahalanobis distances (see mahala_sq$distance).
names = rownames(mahala_sq$means) #capture labels
mahala = sqrt(mahala_sq$distance) #mahalanobis distance
rownames(mahala) = names #set rownames in the dissimilarity matrix
colnames(mahala) = names #set colnames in the dissimilarity matrix
mahala #this is the mahalanobis dissimilarity matrix
A B C D
A 0.00000 17.78689 86.83294 62.65437
B 17.78689 0.00000 69.07937 80.31577
C 86.83294 69.07937 0.00000 149.36579
D 62.65437 80.31577 149.36579 0.00000
#This is how I used the dissimilarity matrix to find clusters.
cluster = agnes(mahala,diss=TRUE,keep.diss=FALSE,method="complete") #hierarchical clustering
plot(cluster,which.plots=2) #plot dendrogram
我想计算 Mahalanobis distance 物种组,其中:
- i) 有两个以上的群体(两个以上的物种)。
- ii) 有多个变量(此类物种的特征)需要考虑。
- iii) 每组有多个观察值(在数据框中,这意味着每个物种有不止一行)。
我试图了解如何在这种情况下 运行 R 中的 mahalanobis 函数。这个问题类似于:
Mahalanobis distance on R for more than 2 groups
但在那里,只使用了一个变量。多于一个变量怎么办?
下面有一个例子,我相信它重现了我的实际数据。
Sp. X1 X2 X3
A 0.7 11 215
B 0.8 7 214
B 0.8 6.5 187
C 0.3 4 456
D 0.4 3 111
A 0.1 7 205
A 0.2 7 196
C 0.1 9.3 77
D 0.6 8 135
D 0.8 4 167
B 0.4 6 228
C 0.1 5 214
A 0.4 7 156
C 0.5 2 344
Sp。 = 实物; X1、X2 和 X3 是观测变量。
在真实的数据集中,有50多个物种,观察的数量各不相同(从100 rows/specie到1000)。
这就是我得到的,使用 HDMD 包中的 pairwise.mahalanobis
函数:
#data
a = structure(list(Sp = structure(c(1L, 2L, 2L, 3L, 4L, 1L, 1L, 3L,4L, 4L, 2L, 3L, 1L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
X1 = c(0.7, 0.8, 0.8, 0.3, 0.4, 0.1, 0.2, 0.1, 0.6, 0.8,0.4, 0.1, 0.4, 0.5),
X2 = c(11, 7, 6.5, 4, 3, 7, 7, 9.3,8, 4, 6, 5, 7, 2),
X3 = c(215L, 214L, 187L, 456L, 111L, 205L,196L, 77L, 135L, 167L, 228L, 214L, 156L, 344L)),
.Names = c("Sp","X1", "X2", "X3"),
row.names = c(NA, -14L),
class = "data.frame")
library(HDMD) #pairwise.mahalanobis function
library(cluster) #agnes function
group = matrix(a$Sp) #what is being compared
group = t(group[,1]) #prepare for pairwise.mahalanobis function
variables = c("X1","X2","X3") #variables (what is being used for comparison)
variables = as.matrix(a[,variables]) #prepare for pairwise.mahalanobis function
mahala_sq = pairwise.mahalanobis(x=variables, grouping=group) #get squared mahalanobis distances (see mahala_sq$distance).
names = rownames(mahala_sq$means) #capture labels
mahala = sqrt(mahala_sq$distance) #mahalanobis distance
rownames(mahala) = names #set rownames in the dissimilarity matrix
colnames(mahala) = names #set colnames in the dissimilarity matrix
mahala #this is the mahalanobis dissimilarity matrix
A B C D
A 0.00000 17.78689 86.83294 62.65437
B 17.78689 0.00000 69.07937 80.31577
C 86.83294 69.07937 0.00000 149.36579
D 62.65437 80.31577 149.36579 0.00000
#This is how I used the dissimilarity matrix to find clusters.
cluster = agnes(mahala,diss=TRUE,keep.diss=FALSE,method="complete") #hierarchical clustering
plot(cluster,which.plots=2) #plot dendrogram