如何在 R 中找到 1 行和其余数据帧之间的最佳相似性？

Question

如何找到数据帧中某一特定行与其余行之间的最佳相似点？

我试着解释一下我的意思。看看这个数据框：

df <- structure(list(person = 1:5, var1 = c(1L, 5L, 2L, 2L, 5L), var2 = c(4L, 
4L, 3L, 2L, 2L), var3 = c(5L, 4L, 4L, 3L, 1L)), .Names = c("person", 
"var1", "var2", "var3"), class = "data.frame", row.names = c(NA, 
-5L))

如何找到数据框中第 1 个人（第 1 行）与其余行（个人）之间的最佳相似点。输出应该是这样的：第 1 个人仍然在第 1 行，其余行按最相似的顺序排列。我想使用的相似度算法是余弦或皮尔逊。我试图用 arules package 中的函数解决我的问题，但它与我的需求不匹配。

有人有什么想法吗？

Answer 1

您可以尝试 cosine 来自 lsa:

library('lsa') 
cosine(t(df[-1]))
#          [,1]      [,2]      [,3]      [,4]      [,5]
#[1,] 1.0000000 0.8379571 0.9742160 0.9356015 0.5070926
#[2,] 0.8379571 1.0000000 0.9346460 0.9637388 0.8947540
#[3,] 0.9742160 0.9346460 1.0000000 0.9908302 0.6780635
#[4,] 0.9356015 0.9637388 0.9908302 1.0000000 0.7527727
#[5,] 0.5070926 0.8947540 0.6780635 0.7527727 1.0000000

你向 cosine 提供一个矩阵，其中每一列代表一个人（这就是我使用 t 的原因），它会计算他们之间的所有余弦相似度。

Answer 2

另一个想法是手动定义余弦函数，并将其应用于您的数据框，即

f1 <- function(x, y){
  crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
}

df[c(1, order(sapply(2:nrow(df), function(i) 
                                f1(unlist(df[1,-1]), unlist(df[i, -1]))), 
                                                          decreasing = TRUE)+1),]

这给出了，

   person var1 var2 var3
1      1    1    4    5
3      3    2    3    4
4      4    2    2    3
2      2    5    4    4
5      5    5    2    1

如何在 R 中找到 1 行和其余数据帧之间的最佳相似性？

How to find best resemblance between 1 row and the rest of dataframe in R?

r

similarity

cosine-similarity