在数据框中存储列表的有效方法

Question

我需要能够计算列表的成对交集，接近 40k。具体来说，我想知道我是否可以将向量 id 存储为第 1 列，并将其值列表存储在第 2 列中。我应该能够处理此第 2 列，即在两行之间找到 overlap/intersections。

column 1  column 2
idA       1,2,5,9,10
idB       5,9,25
idC       2,25,67

我希望能够获得成对的交集值，而且，如果第 2 列中的值尚未排序，那也应该是可能的。

如果我继续使用 R，我可以使用的最佳数据结构是什么？我的数据原来是这样的：

column1 1 2 3 9 10 25 67 5
idA     1 1 0 1  1  0  0 1
idB     0 0 0 1  0  1  0 1
idC     0 1 0 0  0  1  1 0

根据以下建议进行了编辑以包含更清晰的内容。

Answer 1

我会将数据保存在逻辑矩阵中：

DF <- read.table(text = "column1 1 2 3 9 10 25 67 5
idA     1 1 0 1  1  0  0 1
idB     0 0 0 1  0  1  0 1
idC     0 1 0 0  0  1  1 0", header = TRUE, check.names = FALSE)

#turn into logical matrix
m <- as.matrix(DF[-1])
rownames(m) <- DF[[1]]
mode(m) <- "logical"

#if you can, create your data as a sparse matrix to save memory
#if you already have a dense data matrix, keep it that way
library(Matrix)
M <- as(m, "lMatrix")

#calculate intersections 
#does each comparison twice
intersections <- simplify2array(
  lapply(seq_len(nrow(M)), function(x) 
    lapply(seq_len(nrow(M)), function(x, y) colnames(M)[M[x,] & (M[x,] == M[y,])], x = x)
  )
)

这个双循环可以优化。我会在 Rcpp 中执行此操作并创建一个长格式 data.frame 而不是列表矩阵。我也只会进行一次比较（例如，只比较上三角）。

colnames(intersections) <- rownames(intersections) <- rownames(M)
#    idA         idB         idC        
#idA Character,5 Character,2 "2"        
#idB Character,2 Character,3 "25"       
#idC "2"         "25"        Character,3

intersections["idA", "idB"]
#[[1]]
#[1] "9" "5"

在数据框中存储列表的有效方法

efficient way to store lists within a dataframe

r

rstudio