如何通过用 data.table 或 dplyr 替换复杂而缓慢的 plyr 步骤来提高 R 代码的速度?
How to improve speed of R code by replacing complex and slow plyr steps with data.table or dplyr?
我一直在通过搜索其他人如何在 Whosebug 上做事来学习 R,因此,我已经熟悉了 plyr 语法。我有以下四个带有 ddply 的 plyr 调用,它们是我的代码的速率限制步骤。由于 data.table,我的数据接近数十万到数百万条记录,我的代码大部分都运行良好,并且仅受限于四个速率限制但关键的 plyr 步骤。我想用 dplyr 或 data.table 替换它们,但我一直在努力复制语法,希望得到任何帮助。
1. mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(cat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))
2. webuniq_test <- ddply(webuniq, c("df.activ.id"),summarise, strRM = paste(replicate(RMCount, "RM"), collapse = ","))
3. webactiv2 <- ddply(webactiv, .(VisitorID), summarize, VisitorPath = paste(Path, collapse = ","), RMpath = paste(strRM, collapse = ","), ConvTot=sum(Conv), Conv2Tot=sum(Conv2), Cov3Tot=sum(Conv3)) #check that nrow dec
4. MeltForSO3 <- ddply(MeltForSO2, c("VisitorID","ID"),summarise, SplitThis = paste(value, collapse = ","))
对于 (1) 这是基准:
#user system elapsed
#378.463 3.136 383.786
这是我在这些步骤中试图完成的(它们是相似的):
- 它们涉及通过 ID 字段或 ID 字段聚合数据
- 粒度字符字段的聚合发生在粘贴和
坍塌。例如,一个字段可能是 driver 在他下车时的停靠点
"stops" 字段值 'a'、'b'、'c' 的包裹
每一站。 Plyr
stops_path = paste(stops, collapse = ",")
会将这些步骤汇总到一行中,如 "a,b,c"
- 数值数据有时会在同一个聚合步骤中汇总,例如
ConvTot=sum(Conv)
我尝试用 dplyr 或 data.table 复制它,但没有成功。
对于这些类型的聚合,使用其中一种比使用另一种有优势吗?我看了一下这个,似乎 data.table 对于我非常简单的用例来说可能更好,因为语法更清晰:
data.table vs dplyr: can one do something well the other can't or does poorly?
这是我用 data.table 复制上面 (1) 的失败尝试:
setkey(setDT(mergeddf2),df.activ.id, MarketingChannel)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ",")), by=list(df.activ.id,Channel)]
这引发了一个错误:unused argument (by = list(df.activ.id, Channel))
我是从代码开始写的,我在 SO 上研究了如何将粘贴合并到 data.table 中。我取出 by 参数只是为了看看会发生什么,并在下面的行中得到另一个错误:
mergeddf3test <- mergeddf2[, list(spotsids = paste(spotID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))]
错误是 "Error in paste(spotID, collapse = ",") : object 'spotID' not found"
,这很奇怪,因为该字段肯定在数据中。我认为此 data.table 行会正确地将数据与 by 字段(df.activ.id 和 Channel)聚合在一起,并将字符字段与上面的 (a,b,c) 示例结合起来。
很明显,鉴于我正在处理的数据规模,我需要正确学习 dplyr 或 data.table 的语法,所以我已经注册了数据营 class两个包。不过,如果能就如何在短期内解决这个问题提供任何帮助,我将不胜感激。
谢谢!
你用 data.table
的复制对我有用(除了那个频道是大写的)。下面是我尝试用 dplyr
和 data.table
复制你的列表的第一步。
# required packages
require(plyr)
require(dplyr)
require(data.table)
示例数据
mergeddf2 <- data.frame(df.activ.id = 1:5,
channel = 1:8,
mainID = 1:40,
DateTime = Sys.Date() - 80:1,
cat = letters[1:6],
effrespflag = rnorm(240),
othervar = 1,
MarketingChannel = 2)
plyr 解决方案
mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize,
spotsids = paste(mainID, collapse = ","),
spotsdt = paste(DateTime, collapse = ","),
spotsinfos = paste(cat, collapse = ","),
effrespflags = paste(effrespflag, collapse = ","))
dplyr 解决方案
mergeddf3.dplyr <-
mergeddf2 %>%
group_by(df.activ.id, channel) %>%
summarise_each(funs = funs(paste(., collapse = ",")), mainID, DateTime, cat, effrespflag) %>%
magrittr::set_colnames(c("df.activ.id", "channel", "spotsids", "spotsdt", "spotsinfos", "effrespflags"))
# check for equality
all.equal(mergeddf3, as.data.frame(mergeddf3.dplyr))
## [1] TRUE
data.table解法
setDT(mergeddf2)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","),
spotsdt = paste(DateTime, collapse = ","),
spotsinfos = paste(cat, collapse = ","),
effrespflags = paste(effrespflag, collapse = ",")),
by=list(df.activ.id,channel)]
# check for equality
all.equal(mergeddf3, setDF(setkeyv(mergeddf3test, c("df.activ.id", "channel"))))
## [1] TRUE
我一直在通过搜索其他人如何在 Whosebug 上做事来学习 R,因此,我已经熟悉了 plyr 语法。我有以下四个带有 ddply 的 plyr 调用,它们是我的代码的速率限制步骤。由于 data.table,我的数据接近数十万到数百万条记录,我的代码大部分都运行良好,并且仅受限于四个速率限制但关键的 plyr 步骤。我想用 dplyr 或 data.table 替换它们,但我一直在努力复制语法,希望得到任何帮助。
1. mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(cat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))
2. webuniq_test <- ddply(webuniq, c("df.activ.id"),summarise, strRM = paste(replicate(RMCount, "RM"), collapse = ","))
3. webactiv2 <- ddply(webactiv, .(VisitorID), summarize, VisitorPath = paste(Path, collapse = ","), RMpath = paste(strRM, collapse = ","), ConvTot=sum(Conv), Conv2Tot=sum(Conv2), Cov3Tot=sum(Conv3)) #check that nrow dec
4. MeltForSO3 <- ddply(MeltForSO2, c("VisitorID","ID"),summarise, SplitThis = paste(value, collapse = ","))
对于 (1) 这是基准:
#user system elapsed
#378.463 3.136 383.786
这是我在这些步骤中试图完成的(它们是相似的):
- 它们涉及通过 ID 字段或 ID 字段聚合数据
- 粒度字符字段的聚合发生在粘贴和
坍塌。例如,一个字段可能是 driver 在他下车时的停靠点
"stops" 字段值 'a'、'b'、'c' 的包裹
每一站。 Plyr
stops_path = paste(stops, collapse = ",")
会将这些步骤汇总到一行中,如 "a,b,c" - 数值数据有时会在同一个聚合步骤中汇总,例如
ConvTot=sum(Conv)
我尝试用 dplyr 或 data.table 复制它,但没有成功。
对于这些类型的聚合,使用其中一种比使用另一种有优势吗?我看了一下这个,似乎 data.table 对于我非常简单的用例来说可能更好,因为语法更清晰: data.table vs dplyr: can one do something well the other can't or does poorly?
这是我用 data.table 复制上面 (1) 的失败尝试:
setkey(setDT(mergeddf2),df.activ.id, MarketingChannel)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ",")), by=list(df.activ.id,Channel)]
这引发了一个错误:unused argument (by = list(df.activ.id, Channel))
我是从代码开始写的,我在 SO 上研究了如何将粘贴合并到 data.table 中。我取出 by 参数只是为了看看会发生什么,并在下面的行中得到另一个错误:
mergeddf3test <- mergeddf2[, list(spotsids = paste(spotID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))]
错误是 "Error in paste(spotID, collapse = ",") : object 'spotID' not found"
,这很奇怪,因为该字段肯定在数据中。我认为此 data.table 行会正确地将数据与 by 字段(df.activ.id 和 Channel)聚合在一起,并将字符字段与上面的 (a,b,c) 示例结合起来。
很明显,鉴于我正在处理的数据规模,我需要正确学习 dplyr 或 data.table 的语法,所以我已经注册了数据营 class两个包。不过,如果能就如何在短期内解决这个问题提供任何帮助,我将不胜感激。
谢谢!
你用 data.table
的复制对我有用(除了那个频道是大写的)。下面是我尝试用 dplyr
和 data.table
复制你的列表的第一步。
# required packages
require(plyr)
require(dplyr)
require(data.table)
示例数据
mergeddf2 <- data.frame(df.activ.id = 1:5,
channel = 1:8,
mainID = 1:40,
DateTime = Sys.Date() - 80:1,
cat = letters[1:6],
effrespflag = rnorm(240),
othervar = 1,
MarketingChannel = 2)
plyr 解决方案
mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize,
spotsids = paste(mainID, collapse = ","),
spotsdt = paste(DateTime, collapse = ","),
spotsinfos = paste(cat, collapse = ","),
effrespflags = paste(effrespflag, collapse = ","))
dplyr 解决方案
mergeddf3.dplyr <-
mergeddf2 %>%
group_by(df.activ.id, channel) %>%
summarise_each(funs = funs(paste(., collapse = ",")), mainID, DateTime, cat, effrespflag) %>%
magrittr::set_colnames(c("df.activ.id", "channel", "spotsids", "spotsdt", "spotsinfos", "effrespflags"))
# check for equality
all.equal(mergeddf3, as.data.frame(mergeddf3.dplyr))
## [1] TRUE
data.table解法
setDT(mergeddf2)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","),
spotsdt = paste(DateTime, collapse = ","),
spotsinfos = paste(cat, collapse = ","),
effrespflags = paste(effrespflag, collapse = ",")),
by=list(df.activ.id,channel)]
# check for equality
all.equal(mergeddf3, setDF(setkeyv(mergeddf3test, c("df.activ.id", "channel"))))
## [1] TRUE