为什么预先计算列最大值的 dplyr 代码比在 mutate 调用中计算它的 dplyr 代码慢？

Question

示例数据框：

ngroups <- 100
nsamples <- 1000
foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))

我想为每个 engine 组找到 cycles 的 max，并用 mutate 创建一个新变量 tte = max(cycles) - cycles。我认为如果我预先计算最大周期列，而不是在 mutate 命令中为每一行重新计算它，代码会更快。原来我错了：

library(microbenchmark)
library(dplyr)
library(magrittr)

add_tte <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles)) %>% 
    mutate(tte = max_cycles - cycles) %>% select(-max_cycles) %>% ungroup
}

add_tte_old <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(tte = max(cycles) - cycles) %>% ungroup
}

microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr      min        lq     mean   median       uq       max neval
# add_tte(foo) 17.45324 21.107264 26.50535 24.52625 28.75208 113.98433   500
# add_tte_old(foo)  8.10376  9.949188 13.35830 12.18336 14.52474  77.64578   500

为什么会这样？ dplyr 只为组计算一次最大值，而不是为行计算一次最大值的原因是什么？

EDIT：即使我在 add_tte 中使用单个 mutate 语句，并且我创建了一个更大的示例，add_tte_old 仍然更快

# these are the only lines of code modified, the rest is as before
nsamples <- 10000

foo <- data.frame(engine = rep(seq(1, ngroups), each = nsamples), cycles = runif(ngroups*nsamples, 0, nsamples))

add_tte <- function(dataset){
  dataset %<>% group_by(engine) %>% mutate(max_cycles = max(cycles), tte = max_cycles - cycles) %>%
  select(-max_cycles) %>% ungroup
}

# the new results are:
microbenchmark(add_tte(foo), add_tte_old(foo), times = 500)
# Unit: milliseconds
# expr      min        lq      mean    median        uq      max neval
# add_tte(foo) 90.46658 107.14015 139.13570 131.83689 158.24358 411.3272   500
# add_tte_old(foo) 39.38357  46.13531  62.57386  52.00782  69.26815 176.1512   500

Answer 1

你做出了一些错误的假设，但除此之外，更重要的是，你没有进行同类比较。

看看下面的两个变体会更有意义：

add_tte <- function(dataset) {
  dataset %<>% group_by(engine) %>% mutate(max_cycles = rep(max(cycles), times = n()), tte = max_cycles - cycles) %>%
    select(-max_cycles) %>% ungroup
}

add_tte_old <- function(dataset) {
  dataset %<>% group_by(engine) %>% mutate(extra = rep(1, times = n()), tte = max(cycles) - cycles) %>%
    select(-extra) %>% ungroup
}

microbenchmark(add_tte(foo), add_tte_old(foo), times = 100)

在我的机器上，这两个非常相似。

具有讽刺意味的是，通过您尝试预先计算 max(cycles) 的方式，您可能做了您试图避免的事情:)

在这种情况下，您确实应该使用显式 rep() 来填充列，而在减法 max(cycles) - cycles 中自动回收是可以的。

为什么预先计算列最大值的 dplyr 代码比在 mutate 调用中计算它的 dplyr 代码慢？

Why dplyr code which precomputes the maximum of a column is slower than dplyr code which computes it inside the mutate call?

r

microbenchmark

dplyr