比较顶点查找 table 与边列表并将顶点标签分配给匹配的任何边的有效方法
Efficient way to compare vertex lookup table with edge list and assign vertex label to any edge that matches
我有两个数据框。首先,包含顶点名称列表的查找 table:
lookup <- data.frame(Name=c("Bob","Jane"))
然后我有一个如下所示的边缘列表:
edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))
对于“查找”中的每个唯一顶点 table,我想遍历“边”table 并标记每个 edgeID,其中 lookup$Name 在顶点中。
我可以使用以下脚本来做到这一点:
library(igraph)
g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
as.character(lookup$Name),
function(nm) {
z <- c(nm, V(g)$name[distances(g, nm) == 1])
cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
}
)
)
)
group vertex1 edgeID vertex2
1 Bob Bob 1 Bill
2 Bob Bill 1 Bob
3 Bob Bob 1 Jane
4 Bob Jane 1 Bob
5 Bob Bill 1 Jane
6 Bob Bob 2 Jane
7 Bob Jane 2 Bob
8 Bob Jane 1 Bill
9 Jane Bob 1 Bill
10 Jane Bill 1 Bob
11 Jane Bob 1 Jane
12 Jane Jane 1 Bob
13 Jane Bill 1 Jane
14 Jane Jane 1 Jill
15 Jane Bob 2 Jane
16 Jane Jane 2 Bob
17 Jane Jane 1 Bill
18 Jane Jane 3 Jill
19 Jane Jill 3 Jane
20 Jane Jane 3 Susan
21 Jane Susan 3 Jane
22 Jane Susan 3 Jill
问题是这对于大型边缘列表来说似乎效率低下。在我的真实数据中,“查找”有 3,263 个观察值,而“边缘”有 167,775,170 个观察值。我已经尝试在具有 16 个内核和 100GB 或 RAM 的 Amazon EC2 实例上 运行 上面的脚本两天了,现在看不到尽头(使用“future_lapply”而不是“lapply" 以允许并行处理)。有什么方法可以使它更 efficient/faster?
这不会是我唯一一次需要像这样对边缘进行分组,我希望找到一种方法来做到这一点,在时间和亚马逊账单方面都不会那么昂贵。
我想你可以先缩小原来的 data.frame edges
,然后你就可以避免在每次迭代的 lapply
中使用 unique
。
下面的代码可能会加快一点,但不确定它如何获得您的真实数据。
edges.unique <- unique(edges[c(1, 3, 2)])
g <- graph_from_data_frame(edges.unique, directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
lookup$Name,
function(nm) {
z <- colnames(d <- distances(g, nm))[which(d < 2)]
cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
}
)
)
)
更新
edges.unique <- unique(
transform(
edges[c("vertex1", "vertex2", "edgeID")],
vertex1 = ifelse(vertex1 < vertex2, vertex1, vertex2),
vertex2 = ifelse(vertex1 < vertex2, vertex2, vertex1)
)
)
g <- graph_from_data_frame(edges.unique, directed = FALSE)
res <- do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
lookup$Name,
function(nm) {
z <- colnames(d <- distances(g, nm))[which(d < 2)]
cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
}
)
)
)
给予
> res
group vertex1 vertex2 edgeID
1 Bob Bill Bob 1
2 Bob Bob Jane 1
3 Bob Bill Jane 1
4 Bob Bob Jane 2
5 Jane Bill Bob 1
6 Jane Bob Jane 1
7 Jane Bill Jane 1
8 Jane Jane Jill 1
9 Jane Bob Jane 2
10 Jane Jane Jill 3
11 Jane Jane Susan 3
12 Jane Jill Susan 3
当你输入plot(g)
时,你会看到简化的如下
我有两个数据框。首先,包含顶点名称列表的查找 table:
lookup <- data.frame(Name=c("Bob","Jane"))
然后我有一个如下所示的边缘列表:
edges <- data.frame(vertex1 = c("Bob","Bill","Bob","Jane","Bill","Jane","Bob","Jane","Bob","Bill","Bob"
,"Jane","Bill","Jane","Bob","Jane","Jane","Jill","Jane","Susan","Susan"),
edgeID = c(1,1,1,1,1,1,2,2,1,1,1,1,1,1,2,2,3,3,3,3,3),
vertex2 = c("Bill","Bob","Jane","Bob","Jane","Jill","Jane","Bob","Bill","Bob"
,"Jane","Bob","Jane","Bill","Jane","Bob","Jill","Jane","Susan","Jane","Jill"))
对于“查找”中的每个唯一顶点 table,我想遍历“边”table 并标记每个 edgeID,其中 lookup$Name 在顶点中。
我可以使用以下脚本来做到这一点:
library(igraph)
g <- graph_from_data_frame(edges[c(1, 3, 2)], directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
as.character(lookup$Name),
function(nm) {
z <- c(nm, V(g)$name[distances(g, nm) == 1])
cbind(group = nm, unique(subset(edges, vertex1 %in% z & vertex2 %in% z)))
}
)
)
)
group vertex1 edgeID vertex2
1 Bob Bob 1 Bill
2 Bob Bill 1 Bob
3 Bob Bob 1 Jane
4 Bob Jane 1 Bob
5 Bob Bill 1 Jane
6 Bob Bob 2 Jane
7 Bob Jane 2 Bob
8 Bob Jane 1 Bill
9 Jane Bob 1 Bill
10 Jane Bill 1 Bob
11 Jane Bob 1 Jane
12 Jane Jane 1 Bob
13 Jane Bill 1 Jane
14 Jane Jane 1 Jill
15 Jane Bob 2 Jane
16 Jane Jane 2 Bob
17 Jane Jane 1 Bill
18 Jane Jane 3 Jill
19 Jane Jill 3 Jane
20 Jane Jane 3 Susan
21 Jane Susan 3 Jane
22 Jane Susan 3 Jill
问题是这对于大型边缘列表来说似乎效率低下。在我的真实数据中,“查找”有 3,263 个观察值,而“边缘”有 167,775,170 个观察值。我已经尝试在具有 16 个内核和 100GB 或 RAM 的 Amazon EC2 实例上 运行 上面的脚本两天了,现在看不到尽头(使用“future_lapply”而不是“lapply" 以允许并行处理)。有什么方法可以使它更 efficient/faster?
这不会是我唯一一次需要像这样对边缘进行分组,我希望找到一种方法来做到这一点,在时间和亚马逊账单方面都不会那么昂贵。
我想你可以先缩小原来的 data.frame edges
,然后你就可以避免在每次迭代的 lapply
中使用 unique
。
下面的代码可能会加快一点,但不确定它如何获得您的真实数据。
edges.unique <- unique(edges[c(1, 3, 2)])
g <- graph_from_data_frame(edges.unique, directed = FALSE)
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
lookup$Name,
function(nm) {
z <- colnames(d <- distances(g, nm))[which(d < 2)]
cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
}
)
)
)
更新
edges.unique <- unique(
transform(
edges[c("vertex1", "vertex2", "edgeID")],
vertex1 = ifelse(vertex1 < vertex2, vertex1, vertex2),
vertex2 = ifelse(vertex1 < vertex2, vertex2, vertex1)
)
)
g <- graph_from_data_frame(edges.unique, directed = FALSE)
res <- do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
lookup$Name,
function(nm) {
z <- colnames(d <- distances(g, nm))[which(d < 2)]
cbind(group = nm, subset(edges.unique, vertex1 %in% z & vertex2 %in% z))
}
)
)
)
给予
> res
group vertex1 vertex2 edgeID
1 Bob Bill Bob 1
2 Bob Bob Jane 1
3 Bob Bill Jane 1
4 Bob Bob Jane 2
5 Jane Bill Bob 1
6 Jane Bob Jane 1
7 Jane Bill Jane 1
8 Jane Jane Jill 1
9 Jane Bob Jane 2
10 Jane Jane Jill 3
11 Jane Jane Susan 3
12 Jane Jill Susan 3
当你输入plot(g)
时,你会看到简化的如下