R - 如何将数据框嵌套在并行核心中的一组上
R - how to nest data frame over a group in parallel cores
知道如何 运行 在并行内核中执行以下操作吗?
库和示例数据
libs <- c("plyr", "dplyr", "tidyr")
sapply(libs, require, character.only = T)
set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE), value = runif(100000))
对 运行 并行核心的操作:
df %>%
group_by(id) %>%
nest()
如有任何帮助,我们将不胜感激!
使用多个内核简单地嵌套一个 data.frame
效率不高。所以我假设您想执行一些其他计算。下面的示例计算 summary
,每个组 ID 将有多个值。
multidplyr
包方便这种东西。
# replace plyr with multidplyr
libs <- c("dplyr", "tidyr",'multidplyr')
devtools::install_github("hadley/multidplyr")
sapply(libs, require, character.only = T)
set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE),
value = runif(100000))%>%as.tbl
# first the single core solution. No need to nest,
# since group_by%>%do() automatically nests.
x<-df%>%
group_by(id)%>%
# nest()%>%
do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
ungroup
# next, multiple core solution
n_cores<-2
cl<-multidplyr::create_cluster(n_cores)
# you have to load the packages into each cluster
cluster_library(cl,c('dplyr','tidyr'))
df_mp<-df%>%multidplyr::partition(cluster = cl,id) # group by id
x_mp<-df_mp%>%
do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
collect()%>%
ungroup
结果匹配。除非您进行的计算比将数据加载到每个不同的进程慢,否则您可能不会加快速度。
all.equal(unnest(x_mp),unnest(x))
x_mp
TRUE
# A tibble: 10 x 2
id stat_summary
<int> <list>
1 3 <tibble [1 x 6]>
2 5 <tibble [1 x 6]>
3 6 <tibble [1 x 6]>
4 7 <tibble [1 x 6]>
5 1 <tibble [1 x 6]>
6 2 <tibble [1 x 6]>
7 4 <tibble [1 x 6]>
8 8 <tibble [1 x 6]>
9 9 <tibble [1 x 6]>
10 10 <tibble [1 x 6]>
知道如何 运行 在并行内核中执行以下操作吗?
库和示例数据
libs <- c("plyr", "dplyr", "tidyr")
sapply(libs, require, character.only = T)
set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE), value = runif(100000))
对 运行 并行核心的操作:
df %>%
group_by(id) %>%
nest()
如有任何帮助,我们将不胜感激!
使用多个内核简单地嵌套一个 data.frame
效率不高。所以我假设您想执行一些其他计算。下面的示例计算 summary
,每个组 ID 将有多个值。
multidplyr
包方便这种东西。
# replace plyr with multidplyr
libs <- c("dplyr", "tidyr",'multidplyr')
devtools::install_github("hadley/multidplyr")
sapply(libs, require, character.only = T)
set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE),
value = runif(100000))%>%as.tbl
# first the single core solution. No need to nest,
# since group_by%>%do() automatically nests.
x<-df%>%
group_by(id)%>%
# nest()%>%
do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
ungroup
# next, multiple core solution
n_cores<-2
cl<-multidplyr::create_cluster(n_cores)
# you have to load the packages into each cluster
cluster_library(cl,c('dplyr','tidyr'))
df_mp<-df%>%multidplyr::partition(cluster = cl,id) # group by id
x_mp<-df_mp%>%
do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
collect()%>%
ungroup
结果匹配。除非您进行的计算比将数据加载到每个不同的进程慢,否则您可能不会加快速度。
all.equal(unnest(x_mp),unnest(x))
x_mp
TRUE
# A tibble: 10 x 2
id stat_summary
<int> <list>
1 3 <tibble [1 x 6]>
2 5 <tibble [1 x 6]>
3 6 <tibble [1 x 6]>
4 7 <tibble [1 x 6]>
5 1 <tibble [1 x 6]>
6 2 <tibble [1 x 6]>
7 4 <tibble [1 x 6]>
8 8 <tibble [1 x 6]>
9 9 <tibble [1 x 6]>
10 10 <tibble [1 x 6]>