计算 R 中 csv 文件中元素之间的相似度

Question

首先，我是R新手

我的数据框为：

df<-

column-1  column-2 column-3 column-4

vf34       bn56     qw34    mn569
vf34       cv34             mn569
           bn56     qw34    asder45
nght       cv34             asder45
vf34       cv34             mn569

现在我想将相似度矩阵计算为：

Output1:
          vf34  nght  bn56  cv34  qw34   mn569  asder45     
vf34      0     0     1     2     1      3      0
nght      0     0     0     1     0      0      1
bn56      1     0     0     0     2      1      1
cv34      2     1     0     0     0      2      1
qw34      1     0     2     0     0      1      1
mn569     3     0     1     2     1      0      0
asder45   0     1     1     1     1      0      0

所以，基本上它应该从数据帧（或 csv 文件）中找到所有可能的对，并形成一个包含出现次数的矩阵。

例如：第一行第六列是3。也就是说在整个数据中vf34和mn569的组合有出现了 3 次。

数据中的空白值表示原始数据本身缺少数据。

我可以在 python 中使用 countvectorizer 执行此操作，然后将获得的矩阵与其转置相乘。但是我是 R 的新手。有人可以帮我解决这个问题吗？

 and Output2 that i need is:

1  1 3 2 1 0
 and so on for 5 rows.

 This 1; 1; 3; 2; 1; 0 means: 
 (vf34 and bn56); (vf34 and qw34); (vf34 and mn569); (bn56 and qw34); (bn56 and mn569); 
 (qw34 and mn569) combinations that have occurred.
 These values can be obtained from output1 that is given above.

我需要所有五行的这些值。如何做到这一点？

Answer 1

这是一种获得预期结果的方法。工作流程是：

从 "dataset" (unique(unlist(df)))
删除空字符串 ('')
创建列 (combn(1:..)) 的成对组合 ("indx")
split "indx" 按 "indx"
子集 "df" (df[x])
删除空字符串
将 "character" 列更改为 "factor" class，级别为 "Un1"
使用 table 获取频率并对列表元素求和 (+)。

将结果(res)和转置再次求和，使上下对角线元素相同

Un <- unique(unlist(df))
Un1 <- Un[Un!='']
indx <- combn(1:ncol(df),2)
res <- Reduce(`+`,lapply(split(indx, col(indx)), function(x) {
            x1 <- df[x]
            x2 <- x1[!(x1[,1]==''|x1[,2]==''),]
            x2[] <- lapply(x2, factor, levels=Un1)
            tbl <- table(x2)}))

 res1 <- res+t(res)
res1
#           column.2
#column.1  vf34 nght bn56 cv34 qw34 mn569 asder45
# vf34       0    0    1    2    1     3       0
# nght       0    0    0    1    0     0       1
# bn56       1    0    0    0    2     1       1
# cv34       2    1    0    0    0     2       1
# qw34       1    0    2    0    0     1       1
# mn569      3    0    1    2    1     0       0
# asder45    0    1    1    1    1     0       0

更新

关于 "output2"，不是很清楚，因为值与您的预期结果不匹配（可能是错字？）

lapply(seq_len(nrow(df)), function(i) {x1 <- unlist(df[i,])
                        x2 <- x1[x1!='']
                        i1 <- combn(x2,2)
                   diag(res1[i1[1,], i1[2,]])})
#[[1]]
#[1] 1 1 3 2 1 1

#[[2]]
#[1] 2 3 2

#[[3]]
#[1] 2 1 1

#[[4]]
#[1] 1 1 1

#[[5]]
#[1] 2 3 2

数据

df <- structure(list(column.1 = c("vf34", "vf34", "", "nght", "vf34"
), column.2 = c("bn56", "cv34", "bn56", "cv34", "cv34"), column.3 = c("qw34", 
"", "qw34", "", ""), column.4 = c("mn569", "mn569", "asder45", 
"asder45", "mn569")), .Names = c("column.1", "column.2", "column.3", 
"column.4"), class = "data.frame", row.names = c(NA, -5L))

计算 R 中 csv 文件中元素之间的相似度

Compute similarity between elements from a csv file in R

csv

r

text-mining

更新

数据