是否需要将数据转换为二进制集来计算相似度(jaccard 索引)?

Is it necessary to convert a data to binary set to calculate similarity (jaccard index)?

我需要计算 jaccard 与以下数据框的相似度:

df = data.frame(
    a=c("1", "1", "1", "1", "2", "2", "2", "3", "3", "4", "4", "4", "4"), 
    b=c("100", "101", "111", "25841", "111", "101", "106", "101", "108", "100", "30256", "108", "112"))

是否有必要将数据转换为二进制集?这是怎么做到的?

  100 101 111 25841 106 108 30256 112
1  1   1   1   1     0   0    0    0
2  0   1   1   0     1   0    0    0
3  0   1   0   0     0   1    0    0
4  1   0   0   0     0   1    1    1

并使用 jaccard <- vegdist(df, method = "jaccard")

Jaccard 索引可以从二进制导出table。请参阅这篇 wikipedia 文章。

在这里,我展示另一种获取jaccard索引的方法。

# Data
df = data.frame(
  a=c("1", "1", "1", "1", "2", "2", "2", "3", "3", "4", "4", "4", "4"), 
  b=c("100", "101", "111", "25841", "111", "101", "106", "101", "108", "100", "30256", "108", "112"),
  stringsAsFactors = FALSE)

library('data.table')
setDT(df)

# jaccard index
jaccard_index <- function(x,y)
{
  x_int <- intersect(x,y)  # xny
  x_union <- union(x,y)    # xuy
  return( length(x_int)/length(x_union))
}

ji <- combn(unique(df$a), 2, FUN = function(z){
  x <- df[ a %in% z[1], b]
  y <- df[ a %in% z[2], b]
  jaccard_index(x,y)
  })
ji <- setNames( ji, combn(unique(df$a), 2, FUN = paste0, collapse = ""))
ji
#        12        13        14        23        24        34 
# 0.4000000 0.2000000 0.1428571 0.2500000 0.0000000 0.2000000 
# jaccard distance
jd <- 1- ji
jd
#        12        13        14        23        24        34 
# 0.6000000 0.8000000 0.8571429 0.7500000 1.0000000 0.8000000

使用示例中的测试数据here。它还显示预期输出作为参考点:

# test data
test <- data.frame( a = c(rep("A",5), rep("B", 7)),
                    b = c(0,1,2,5,6,0,2,3,4,5,7,9),
                    stringsAsFactors = FALSE)
setDT(test)

# jaccard index
ji_test <- combn(unique(test$a), 2, FUN = function(z){
  x <- test[ a %in% z[1], b]
  y <- test[ a %in% z[2], b]
  jaccard_index(x,y)
})
ji_test <- setNames( ji_test, combn(unique(test$a), 2, FUN = paste0, collapse = ""))
ji_test
# AB 
# 0.3333333 

# jaccard distance
jd_test <- 1- ji_test
jd_test
# AB 
# 0.6666667