data.table 的表现

Question

我一直认为 data.table 提供了最佳的数据访问性能。

但是，当我对以下 2 个语句进行基准测试时，我遇到了以下结果。

app_sig_reg[which(app_sig_reg$input == proj$country),]$value
app_sig_reg[input == proj$country,value]

其中 app_sig_reg 是一个 data.table 对象。

这是我运行 microbenchmark 库衡量其性能时得到的结果。

microbenchmark(
  app_sig_reg[which(app_sig_reg$input == proj$country),]$value,
  app_sig_reg[input == proj$country,value]
)

Unit: microseconds
                                                          expr   min     lq     mean  median      uq    max neval
 app_sig_reg[which(app_sig_reg$input == proj$country), ]$value 118.5 132.05  165.932  146.55  163.70  489.1   100
                     app_sig_reg[input == proj$country, value] 967.3 993.85 1098.607 1028.05 1123.35 1752.6   100

我的假设是 app_sig_reg[input == proj$country,value] 会执行得更快，但结果恰恰相反。

如有任何见解，我将不胜感激。

Answer 1

问题是关于匹配什么并不完全清楚。如果只有一个country，那么下面的结果表明速度取决于

which 与 equal;
class "data.table".

$

[

如果不是针对一个元素 (country) 进行等式测试，而是针对多个 %in% 进行等式测试，结果可能会有所不同。

library(data.table)
library(microbenchmark)
library(ggplot2)

set.seed(2022)
app_sig_reg <- data.table(
  input = sample(letters, 100, TRUE),
  value = runif(100)
)
proj <- data.table(country = sample(letters, 1))


testFun <- function(X, n){
  out <- lapply(seq.int(n), \(k){
    Y <- X
    for(i in seq.int(k)) Y <- rbind(Y, Y)
    mb <- microbenchmark(
      `which$` = Y[which(Y$input == proj$country), ]$value,
      `which[` = Y[which(input == proj$country), value],
      `equal$` = Y[input == proj$country,]$value,
      `equal[` = Y[input == proj$country,value]
    )
    agg <- aggregate(time ~ expr, mb, median)
    agg$nrow <- nrow(Y)
    agg
  })
  do.call(rbind, out)
}

res <- testFun(app_sig_reg, 15)

ggplot(res, aes(nrow, time, color = expr)) +
  geom_line() +
  geom_point() +
  scale_color_manual(values = c(`which$` = "red", `equal$` = "orangered", `which[` = "blue", `equal[` = "skyblue")) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10") +
  theme_bw()

^{由 reprex package (v2.0.1)}

于 2022-02-20 创建

data.table 的表现

Performance of data.table

r

microbenchmark

data.table

tidyverse