比较顶点查找 table 与边列表并将顶点标签分配给匹配的任何边的有效方法

Efficient way to compare vertex lookup table with edge list and assign vertex label to any edge that matches

我有两个数据框。首先,包含顶点名称列表的查找 table:

lookup <- data.frame(Name=c("Bob","Jane"))

然后我有一个如下所示的边缘列表:

edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
                              ,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
                  edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
                  vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
                              ,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))

对于“查找”中的每个唯一顶点 table,我想遍历“边”table 并标记每个 edgeID,其中 lookup$Name 在顶点中。

我可以使用以下脚本来做到这一点:

library(igraph)

g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      as.character(lookup$Name),
      function(nm) {
        z <- c(nm, V(g)$name[distances(g, nm) == 1])
        cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
      }
    )
  )
)
   group vertex1 edgeID vertex2
1    Bob     Bob      1    Bill
2    Bob    Bill      1     Bob
3    Bob     Bob      1    Jane
4    Bob    Jane      1     Bob
5    Bob    Bill      1    Jane
6    Bob     Bob      2    Jane
7    Bob    Jane      2     Bob
8    Bob    Jane      1    Bill
9   Jane     Bob      1    Bill
10  Jane    Bill      1     Bob
11  Jane     Bob      1    Jane
12  Jane    Jane      1     Bob
13  Jane    Bill      1    Jane
14  Jane    Jane      1    Jill
15  Jane     Bob      2    Jane
16  Jane    Jane      2     Bob
17  Jane    Jane      1    Bill
18  Jane    Jane      3    Jill
19  Jane    Jill      3    Jane
20  Jane    Jane      3   Susan
21  Jane   Susan      3    Jane
22  Jane   Susan      3    Jill

问题是这对于大型边缘列表来说似乎效率低下。在我的真实数据中,“查找”有 3,263 个观察值,而“边缘”有 167,775,170 个观察值。我已经尝试在具有 16 个内核和 100GB 或 RAM 的 Amazon EC2 实例上 运行 上面的脚本两天了,现在看不到尽头(使用“future_lapply”而不是“lapply" 以允许并行处理)。有什么方法可以使它更 efficient/faster?

这不会是我唯一一次需要像这样对边缘进行分组,我希望找到一种方法来做到这一点,在时间和亚马逊账单方面都不会那么昂贵。

我想你可以先缩小原来的 data.frame edges,然后你就可以避免在每次迭代的 lapply 中使用 unique

下面的代码可能会加快一点,但不确定它如何获得您的真实数据。

edges.unique <- unique(edges[c(1, 3, 2)])
g <- graph_from_data_frame(edges.unique, directed = FALSE)
do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

更新

edges.unique <- unique(
  transform(
    edges[c("vertex1", "vertex2", "edgeID")],
    vertex1 = ifelse(vertex1 < vertex2, vertex1, vertex2),
    vertex2 = ifelse(vertex1 < vertex2, vertex2, vertex1)
  )
)
g <- graph_from_data_frame(edges.unique, directed = FALSE)
res <- do.call(
  rbind,
  c(
    make.row.names = FALSE,
    lapply(
      lookup$Name,
      function(nm) {
        z <- colnames(d <- distances(g, nm))[which(d < 2)]
        cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
      }
    )
  )
)

给予

> res
   group vertex1 vertex2 edgeID
1    Bob    Bill     Bob      1
2    Bob     Bob    Jane      1
3    Bob    Bill    Jane      1
4    Bob     Bob    Jane      2
5   Jane    Bill     Bob      1
6   Jane     Bob    Jane      1
7   Jane    Bill    Jane      1
8   Jane    Jane    Jill      1
9   Jane     Bob    Jane      2
10  Jane    Jane    Jill      3
11  Jane    Jane   Susan      3
12  Jane    Jill   Susan      3

当你输入plot(g)时,你会看到简化的如下