从非常大的对列表中提取单链接簇

Extract single linkage clusters from very large pairs list

我有一个非常大的配对列表,我需要将其分解为单个链接社区。到目前为止,我已经能够完全在 R 中完成这项工作了。但我需要为整个列表可能太大而无法保存在内存中或 igraph 的 R 实现无法处理的情况做好准备。此任务的一个非常简单的版本如下所示:

library(igraph)
df <- data.frame("p1" = c("a", "a", "d", "d"),
                 "p2" = c("b", "c", "e", "f"),
                 "val" = c(0.5, 0.75, 0.25, 0.35))
g <- graph_from_data_frame(d = df,
                           directed = FALSE)

sg <- groups(components(g))
sg <- sapply(sg,
             function(x) induced_subgraph(graph = g,
                                          vids = x),
             USE.NAMES = FALSE,
             simplify = FALSE)

如果 df 非常大 - 在数亿到数百亿行的规模上,有没有办法让我提取 sg 的各个位置而无需构建g 完整吗?对我来说,将 df 的表示作为压缩的 txt 文件或 sqlite 数据库存储在 R 之外相对容易。

解决 igraph 的 R 实现问题(假设数据集仍可保存在 RAM 中,否则请参阅@Paul Brodersen 的回答):

下面的解决方案通过指定图形的一个元素然后遍历所有连接直到找不到更多边来工作。因此它创建子图而不构建整个图。与递归函数相比,它看起来有点老套,但扩展性更好。

library(igraph)    
reduce_graph <- function(df, element) {
        stop = F
        elements_to_inspect <- element
        rows_graph <-0
        while(stop ==F) {
            graph_parts <- df[df$p1 %in% elements_to_inspect | 
                                  df$p2 %in% elements_to_inspect,]
            elements_to_inspect <- unique(c(unique(graph_parts$p1), 
                                            unique(graph_parts$p2)))
            if(dim(graph_parts)[1] == rows_graph) {
                stop <-TRUE
            } else {
                rows_graph <- dim(graph_parts)[1]
            }
        }
        return(graph_parts)
    }

df <- data.frame("p1" = c("a", "a", "d", "d","o"),
                 "p2" = c("b", "c", "e", "f","u"),
                 "val" = c(100, 0.75, 0.25, 0.35,1))

small_graph <- reduce_graph(df, "f")
g <- graph_from_data_frame(d = small_graph,
                           directed = FALSE)

sg <- groups(components(g))
sg <- sapply(sg,
             function(x) induced_subgraph(graph = g,
                                          vids = x),
             USE.NAMES = FALSE,
             simplify = FALSE)

可以在更大的数据集上测试速度。

##larger dataset with lots of sparse graphs.
set.seed(100)
p1 <- as.character(sample(1:10000000, 1000000, replace=T))
p2 <- as.character(sample(1:10000000, 1000000, replace=T))
val <- rep(1, 1000000)
df <- data.frame("p1" = p1,
                 "p2" = p2,
                 "val" = val)

small_graph <- reduce_graph(df, "9420672") #has 3 pairwise connections
g <- graph_from_data_frame(d = small_graph,
                           directed = FALSE)

sg <- groups(components(g))
sg <- sapply(sg,
             function(x) induced_subgraph(graph = g,
                                          vids = x),
             USE.NAMES = FALSE,
             simplify = FALSE)

构建组和子图需要一秒钟,而在我的机器上构建整个图需要几分钟。这当然取决于图的稀疏连接程度。