为什么在 dplyr 中将新名称重新分配给数据框会使其更快？

Question

我对 dplyr 和 data.table 在我的 data.frame 上创建一个新变量的时间感到不满意，并决定比较方法。

令我惊讶的是，将 dplyr::mutate() 的结果重新分配给新的 data.frame 似乎比不这样做更快。

为什么会这样？

library(data.table)
library(tidyverse)


dt <- fread(".... data.csv") #load 200MB datafile

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)

a <- Sys.time()
dt1[, MONTH := month(as.Date(DATE))]
b <- Sys.time(); datatabletook <- b-a

c <- Sys.time()
dt_dplyr <- dt2 %>%
  mutate(MONTH = month(as.Date(DATE)))
d <- Sys.time(); dplyr_reassign_took <- d - c 

e <- Sys.time()
dt3 %>%
  mutate(MONTH = month(as.Date(DATE)))
f <- Sys.time(); dplyrtook <- f - e

datatabletook        = 17sec
dplyrtook            = 47sec
dplyr_reassign_took  = 17sec

Answer 1

有几种方法可以benchmark with base R:

.t0 <- Sys.time()
    ...
.t1 <- Sys.time()
.t1 - t0    

 # or

 system.time({
     ...
 })

使用 Sys.time 方式，您将每一行发送到控制台，并且可能会看到为每一行打印的一些 return 值，正如@Axeman 所建议的那样。对于 {...}，只有一个 return 值（大括号内的最后一个结果）并且 system.time 将禁止它打印。

如果打印成本足够高，但不是您要测量的部分，它可能会有所作为。

有充分的理由选择 system.time 而不是 Sys.time 作为基准测试；来自@MattDowle 的评论：

i) it does a gc first excluded from the timing to isolate from random gc's and

ii) it includes user and sys time as well as elapsed wall clock time.

The Sys.time() way will be affected by reading your email in Chrome or using Excel while the test runs, the system.time() way won't so long as you use the user and sys parts of the result.

为什么在 dplyr 中将新名称重新分配给数据框会使其更快？

Why reassigning new name to dataframe in dplyr makes it faster?

performance

r

dplyr

data.table

tidyverse