如何通过用 data.table 或 dplyr 替换复杂而缓慢的 plyr 步骤来提高 R 代码的速度?

How to improve speed of R code by replacing complex and slow plyr steps with data.table or dplyr?

我一直在通过搜索其他人如何在 Whosebug 上做事来学习 R,因此,我已经熟悉了 plyr 语法。我有以下四个带有 ddply 的 plyr 调用,它们是我的代码的速率限制步骤。由于 data.table,我的数据接近数十万到数百万条记录,我的代码大部分都运行良好,并且仅受限于四个速率限制但关键的 plyr 步骤。我想用 dplyr 或 data.table 替换它们,但我一直在努力复制语法,希望得到任何帮助。

1. mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(cat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))

2. webuniq_test <- ddply(webuniq, c("df.activ.id"),summarise, strRM = paste(replicate(RMCount, "RM"), collapse = ","))

3. webactiv2 <- ddply(webactiv, .(VisitorID), summarize, VisitorPath = paste(Path, collapse = ","), RMpath = paste(strRM, collapse = ","), ConvTot=sum(Conv), Conv2Tot=sum(Conv2), Cov3Tot=sum(Conv3)) #check that nrow dec

4. MeltForSO3 <- ddply(MeltForSO2, c("VisitorID","ID"),summarise, SplitThis = paste(value, collapse = ","))

对于 (1) 这是基准:

#user  system elapsed 
#378.463   3.136 383.786

这是我在这些步骤中试图完成的(它们是相似的):

  1. 它们涉及通过 ID 字段或 ID 字段聚合数据
  2. 粒度字符字段的聚合发生在粘贴和 坍塌。例如,一个字段可能是 driver 在他下车时的停靠点 "stops" 字段值 'a'、'b'、'c' 的包裹 每一站。 Plyr stops_path = paste(stops, collapse = ",") 会将这些步骤汇总到一行中,如 "a,b,c"
  3. 数值数据有时会在同一个聚合步骤中汇总,例如ConvTot=sum(Conv)

我尝试用 dplyr 或 data.table 复制它,但没有成功。

对于这些类型的聚合,使用其中一种比使用另一种有优势吗?我看了一下这个,似乎 data.table 对于我非常简单的用例来说可能更好,因为语法更清晰: data.table vs dplyr: can one do something well the other can't or does poorly?

这是我用 data.table 复制上面 (1) 的失败尝试:

setkey(setDT(mergeddf2),df.activ.id, MarketingChannel)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ",")), by=list(df.activ.id,Channel)] 

这引发了一个错误:unused argument (by = list(df.activ.id, Channel))我是从代码开始写的,我在 SO 上研究了如何将粘贴合并到 data.table 中。我取出 by 参数只是为了看看会发生什么,并在下面的行中得到另一个错误:

mergeddf3test <- mergeddf2[, list(spotsids = paste(spotID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))] 

错误是 "Error in paste(spotID, collapse = ",") : object 'spotID' not found",这很奇怪,因为该字段肯定在数据中。我认为此 data.table 行会正确地将数据与 by 字段(df.activ.id 和 Channel)聚合在一起,并将字符字段与上面的 (a,b,c) 示例结合起来。

很明显,鉴于我正在处理的数据规模,我需要正确学习 dplyr 或 data.table 的语法,所以我已经注册了数据营 class两个包。不过,如果能就如何在短期内解决这个问题提供任何帮助,我将不胜感激。

谢谢!

你用 data.table 的复制对我有用(除了那个频道是大写的)。下面是我尝试用 dplyrdata.table 复制你的列表的第一步。

# required packages
require(plyr)
require(dplyr)
require(data.table)

示例数据

mergeddf2 <- data.frame(df.activ.id = 1:5, 
                        channel = 1:8, 
                        mainID = 1:40, 
                        DateTime = Sys.Date() - 80:1, 
                        cat = letters[1:6], 
                        effrespflag = rnorm(240), 
                        othervar = 1, 
                        MarketingChannel = 2)

plyr 解决方案

mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, 
                   spotsids = paste(mainID, collapse = ","), 
                   spotsdt = paste(DateTime, collapse = ","), 
                   spotsinfos = paste(cat, collapse = ","), 
                   effrespflags = paste(effrespflag, collapse = ","))

dplyr 解决方案

mergeddf3.dplyr <- 
  mergeddf2 %>% 
  group_by(df.activ.id, channel) %>%
  summarise_each(funs = funs(paste(., collapse = ",")), mainID, DateTime, cat, effrespflag) %>%
  magrittr::set_colnames(c("df.activ.id", "channel", "spotsids", "spotsdt", "spotsinfos", "effrespflags")) 
# check for equality
all.equal(mergeddf3, as.data.frame(mergeddf3.dplyr))
## [1] TRUE

data.table解法

setDT(mergeddf2)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), 
                                  spotsdt = paste(DateTime, collapse = ","), 
                                  spotsinfos = paste(cat, collapse = ","), 
                                  effrespflags = paste(effrespflag, collapse = ",")),
                           by=list(df.activ.id,channel)] 
# check for equality
all.equal(mergeddf3, setDF(setkeyv(mergeddf3test, c("df.activ.id", "channel"))))
## [1] TRUE