如何加速结合 rbind 和 lapply 的函数?
How can I speed up a function combining rbind and lapply?
我有一个大数据框(10 万行,19 列)。我需要计算每个月包含 5 个项目的每种可能组合的案例数。
以下代码适用于小型数据集,但对于我的完整数据集,它花费的时间太长了。根据我的搜索,我怀疑预分配数据帧是关键,但我不知道该怎么做。
library(dplyr)
Case<-c(1,1,1,2,2,3,4,5,5,6,6,6,7,8,8,8,9,9,9)
Month<- c("Jan","Jan","Jan","Mar","Mar","Sep","Sep","Nov","Nov","Dec","Dec","Dec","Apr","Dec","Dec","Dec","Dec","Dec","Dec")
Fruits<-c("Apple","Orange","Grape","Grape","Orange","Apple","Apple","Orange","Grape","Apple","Orange","Grape","Grape","Apple","Orange","Grape","Apple","Orange","Grape")
df<-data.frame(Case,Month,Fruits)
Patterns <- with(df, do.call(rbind, lapply(unique(Case), function(x){
y <- subset(df, Case == x )
Date<-as.character(y$Month[1])
Fruits <- paste(unique(y$Fruits[order(y$Fruits)]), collapse = ' / ')
as.data.frame(unique (cbind(Case = y$Case, Date, Fruits)))
})))
Total<-Patterns %>%
group_by(Date,Fruits) %>%
tally()
我得到的结果是可以接受的,但是这个过程花费的时间太长,而且由于数据集很大,我 运行 内存不足。
我们可以使用 dplyr
在一条命令中完成所有操作。首先我们 group_by
Case
和 Month
将所有 Fruits
按组粘贴在一起,然后按 Month
和 Fruits
分组我们添加行数每个组使用 tally
。
library(dplyr)
df %>%
group_by(Case, Month) %>%
summarise(Fruits = paste(Fruits, collapse = "/")) %>%
group_by(Month, Fruits) %>%
tally()
# OR count()
# Month Fruits n
# <fct> <chr> <int>
#1 Apr Grape 1
#2 Dec Apple/Orange/Grape 3
#3 Jan Apple/Orange/Grape 1
#4 Mar Grape/Orange 1
#5 Nov Orange/Grape 1
#6 Sep Apple 2
在大型数据集上,data.table
会比 dplyr 快很多:
library(data.table)
setDT(df)[, lapply(.SD, toString), by = c("Case","Month")][,.N, by = c("Fruits","Month")]
我有一个大数据框(10 万行,19 列)。我需要计算每个月包含 5 个项目的每种可能组合的案例数。
以下代码适用于小型数据集,但对于我的完整数据集,它花费的时间太长了。根据我的搜索,我怀疑预分配数据帧是关键,但我不知道该怎么做。
library(dplyr)
Case<-c(1,1,1,2,2,3,4,5,5,6,6,6,7,8,8,8,9,9,9)
Month<- c("Jan","Jan","Jan","Mar","Mar","Sep","Sep","Nov","Nov","Dec","Dec","Dec","Apr","Dec","Dec","Dec","Dec","Dec","Dec")
Fruits<-c("Apple","Orange","Grape","Grape","Orange","Apple","Apple","Orange","Grape","Apple","Orange","Grape","Grape","Apple","Orange","Grape","Apple","Orange","Grape")
df<-data.frame(Case,Month,Fruits)
Patterns <- with(df, do.call(rbind, lapply(unique(Case), function(x){
y <- subset(df, Case == x )
Date<-as.character(y$Month[1])
Fruits <- paste(unique(y$Fruits[order(y$Fruits)]), collapse = ' / ')
as.data.frame(unique (cbind(Case = y$Case, Date, Fruits)))
})))
Total<-Patterns %>%
group_by(Date,Fruits) %>%
tally()
我得到的结果是可以接受的,但是这个过程花费的时间太长,而且由于数据集很大,我 运行 内存不足。
我们可以使用 dplyr
在一条命令中完成所有操作。首先我们 group_by
Case
和 Month
将所有 Fruits
按组粘贴在一起,然后按 Month
和 Fruits
分组我们添加行数每个组使用 tally
。
library(dplyr)
df %>%
group_by(Case, Month) %>%
summarise(Fruits = paste(Fruits, collapse = "/")) %>%
group_by(Month, Fruits) %>%
tally()
# OR count()
# Month Fruits n
# <fct> <chr> <int>
#1 Apr Grape 1
#2 Dec Apple/Orange/Grape 3
#3 Jan Apple/Orange/Grape 1
#4 Mar Grape/Orange 1
#5 Nov Orange/Grape 1
#6 Sep Apple 2
在大型数据集上,data.table
会比 dplyr 快很多:
library(data.table)
setDT(df)[, lapply(.SD, toString), by = c("Case","Month")][,.N, by = c("Fruits","Month")]