Tidygraph:在父级别计算子摘要
Tidygraph: calculate child summaries at parent level
使用 R 中的 tidygraph 包,给定一棵树,我想计算树中每个节点的每个直接子节点的值的均值、总和、方差...。
我的直觉是使用 map_bfs_back_dbl
或相关的,并尝试修改帮助示例,但卡住了
library(tidygraph)
# Collect values from children
create_tree(40, children = 3, directed = TRUE) %>%
mutate(value = round(runif(40)*100)) %>%
mutate(child_acc = map_bfs_back_dbl(node_is_root(), .f = function(node, path, ...) {
if (nrow(path) == 0) .N()$value[node]
else {
sum(unlist(path$result[path$parent == node]))
}
}))
对于上述内容,我想要树中每个父级的所有直接一级子级的平均值 value
。
更新::
我试过这种方法(计算子属性的方差):
library(tidygraph)
create_tree(40, children = 3, directed = TRUE) %>%
mutate(parent = bfs_parent(),
value = round(runif(40)*100)) %>%
group_by(parent) %>%
mutate(var = var(value))
非常接近:
# Node Data: 40 x 3 (active)
# Groups: parent [14]
parent value var
* <int> <dbl> <dbl>
1 NA 2.00 NA
2 1 13.0 1393
3 1 63.0 1393
4 1 86.0 1393
5 2 27.0 890
6 2 76.0 890
# ... with 34 more rows
我想看到的是:
# Node Data: 40 x 3 (active)
# Groups: parent [14]
parent value var child_var
* <int> <dbl> <dbl> <dbl>
1 NA 2.00 NA 1393
2 1 13.0 1393 890
3 1 63.0 1393 (etc)
4 1 86.0 1393
5 2 27.0 890
6 2 76.0 890
# ... with 34 more rows
将(第一个)"var" 值移动到由 "parent" 值标识的节点。帮助?建议?
编辑:
这就是我最后做的事情:
tree <- create_tree(40, children = 3, directed = TRUE) %>%
mutate(parent = bfs_parent(),
value = round(runif(40) * 100),
name = row_number()) %>%
activate(nodes) %>%
left_join(
tree %>%
group_by(parent) %>%
mutate(var = var(value)) %>% activate(nodes) %>% as_tibble() %>%
group_by(parent) %>% summarize(child_stat = first(var)),
by=c("name" = "parent")
)
感觉图表不是很整洁,但似乎可行。开放优化。
我尝试了一种 "tidygraph" 的处理方式。主要功能是计算 value
列的方差:
calc_child_stats <- function(neighborhood, ...){
## By default the neighborhood includes the parent and all of it's children
## First remove the parent, then run analysis
neighborhood %>% activate(nodes) %>%
slice(-1) %>%
select(value) %>%
pull %>%
var
}
一旦你有了这个功能,那么它就是对 map_local
的简单调用,而不是你尝试的 map_bfs
:
tree <- create_tree(40, children = 3, directed = TRUE) %>%
mutate(value = round(runif(40)*100))
tree %>% mutate(var = map_local_dbl(order = 1, mode="out", .f = calc_child_stats))
#> # A tbl_graph: 40 nodes and 39 edges
#> #
#> # A rooted tree
#> #
#> # Node Data: 40 x 2 (active)
#> value var
#> <dbl> <dbl>
#> 1 29 34.3
#> 2 45 433
#> 3 56 225.
#> 4 47 868
#> 5 78 604.
#> 6 43 283
#> # ... with 34 more rows
#> #
#> # Edge Data: 39 x 2
#> from to
#> <int> <int>
#> 1 1 2
#> 2 1 3
#> 3 1 4
#> # ... with 36 more rows
虽然我的 tidygraph 版本更多 "graphy" 它似乎不是很快,所以我在两种方法之间创建了一个快速的微基准测试:
library(microbenchmark)
microbenchmark(tree %>% mutate(var = map_local_dbl(order = 1, mode="out", .f = calc_child_stats)))
#> Unit: milliseconds
#> expr
#> tree %>% mutate(var = map_local_dbl(order = 1, mode = "out", .f = calc_child_stats))
#> min lq mean median uq max neval
#> 115.3325 123.0303 127.7889 126.6683 130.057 191.6065 100
microbenchmark(calc_child_stats_dplyr(tree))
#> Unit: milliseconds
#> expr min lq mean median uq
#> calc_child_stats_dplyr(tree) 4.915917 5.213939 6.292579 5.573978 6.717745
#> max neval
#> 16.72846 100
由 reprex package (v0.2.0) 创建于 2018-06-15。
果然,dplyr 方式要快得多,所以我暂时坚持使用它。他们在我的测试中给出了相同的值。
为了完整起见,这是我用来复制 op 方法的 fxn:
calc_child_stats_dplyr <- function(tree){
tree <- tree %>%
mutate(parent = bfs_parent(),
name = row_number())
tree %>% activate(nodes) %>%
left_join(
tree %>%
group_by(parent) %>%
mutate(var = var(value)) %>%
activate(nodes) %>%
as_tibble() %>%
group_by(parent) %>%
summarize(child_stat = first(var)),
by=c("name" = "parent")
)
}
使用 R 中的 tidygraph 包,给定一棵树,我想计算树中每个节点的每个直接子节点的值的均值、总和、方差...。
我的直觉是使用 map_bfs_back_dbl
或相关的,并尝试修改帮助示例,但卡住了
library(tidygraph)
# Collect values from children
create_tree(40, children = 3, directed = TRUE) %>%
mutate(value = round(runif(40)*100)) %>%
mutate(child_acc = map_bfs_back_dbl(node_is_root(), .f = function(node, path, ...) {
if (nrow(path) == 0) .N()$value[node]
else {
sum(unlist(path$result[path$parent == node]))
}
}))
对于上述内容,我想要树中每个父级的所有直接一级子级的平均值 value
。
更新:: 我试过这种方法(计算子属性的方差):
library(tidygraph)
create_tree(40, children = 3, directed = TRUE) %>%
mutate(parent = bfs_parent(),
value = round(runif(40)*100)) %>%
group_by(parent) %>%
mutate(var = var(value))
非常接近:
# Node Data: 40 x 3 (active)
# Groups: parent [14]
parent value var
* <int> <dbl> <dbl>
1 NA 2.00 NA
2 1 13.0 1393
3 1 63.0 1393
4 1 86.0 1393
5 2 27.0 890
6 2 76.0 890
# ... with 34 more rows
我想看到的是:
# Node Data: 40 x 3 (active)
# Groups: parent [14]
parent value var child_var
* <int> <dbl> <dbl> <dbl>
1 NA 2.00 NA 1393
2 1 13.0 1393 890
3 1 63.0 1393 (etc)
4 1 86.0 1393
5 2 27.0 890
6 2 76.0 890
# ... with 34 more rows
将(第一个)"var" 值移动到由 "parent" 值标识的节点。帮助?建议?
编辑: 这就是我最后做的事情:
tree <- create_tree(40, children = 3, directed = TRUE) %>%
mutate(parent = bfs_parent(),
value = round(runif(40) * 100),
name = row_number()) %>%
activate(nodes) %>%
left_join(
tree %>%
group_by(parent) %>%
mutate(var = var(value)) %>% activate(nodes) %>% as_tibble() %>%
group_by(parent) %>% summarize(child_stat = first(var)),
by=c("name" = "parent")
)
感觉图表不是很整洁,但似乎可行。开放优化。
我尝试了一种 "tidygraph" 的处理方式。主要功能是计算 value
列的方差:
calc_child_stats <- function(neighborhood, ...){
## By default the neighborhood includes the parent and all of it's children
## First remove the parent, then run analysis
neighborhood %>% activate(nodes) %>%
slice(-1) %>%
select(value) %>%
pull %>%
var
}
一旦你有了这个功能,那么它就是对 map_local
的简单调用,而不是你尝试的 map_bfs
:
tree <- create_tree(40, children = 3, directed = TRUE) %>%
mutate(value = round(runif(40)*100))
tree %>% mutate(var = map_local_dbl(order = 1, mode="out", .f = calc_child_stats))
#> # A tbl_graph: 40 nodes and 39 edges
#> #
#> # A rooted tree
#> #
#> # Node Data: 40 x 2 (active)
#> value var
#> <dbl> <dbl>
#> 1 29 34.3
#> 2 45 433
#> 3 56 225.
#> 4 47 868
#> 5 78 604.
#> 6 43 283
#> # ... with 34 more rows
#> #
#> # Edge Data: 39 x 2
#> from to
#> <int> <int>
#> 1 1 2
#> 2 1 3
#> 3 1 4
#> # ... with 36 more rows
虽然我的 tidygraph 版本更多 "graphy" 它似乎不是很快,所以我在两种方法之间创建了一个快速的微基准测试:
library(microbenchmark)
microbenchmark(tree %>% mutate(var = map_local_dbl(order = 1, mode="out", .f = calc_child_stats)))
#> Unit: milliseconds
#> expr
#> tree %>% mutate(var = map_local_dbl(order = 1, mode = "out", .f = calc_child_stats))
#> min lq mean median uq max neval
#> 115.3325 123.0303 127.7889 126.6683 130.057 191.6065 100
microbenchmark(calc_child_stats_dplyr(tree))
#> Unit: milliseconds
#> expr min lq mean median uq
#> calc_child_stats_dplyr(tree) 4.915917 5.213939 6.292579 5.573978 6.717745
#> max neval
#> 16.72846 100
由 reprex package (v0.2.0) 创建于 2018-06-15。
果然,dplyr 方式要快得多,所以我暂时坚持使用它。他们在我的测试中给出了相同的值。
为了完整起见,这是我用来复制 op 方法的 fxn:
calc_child_stats_dplyr <- function(tree){
tree <- tree %>%
mutate(parent = bfs_parent(),
name = row_number())
tree %>% activate(nodes) %>%
left_join(
tree %>%
group_by(parent) %>%
mutate(var = var(value)) %>%
activate(nodes) %>%
as_tibble() %>%
group_by(parent) %>%
summarize(child_stat = first(var)),
by=c("name" = "parent")
)
}