R4.0 性能:数据帧与列表、循环与向量化——常数向量减法示例

R4.0 performance: dataframes vs lists, loops vs vectorized - example with constant vector substraction

这是我第三次阅读 Hadley Wickham 臭名昭着的Advanced R and in the second chapter he explains why iterating over lists is better than over dataframes. Everything seems reasonable, familiar, and expected. The example function subtracts a vector of medians from each column of a dataframe, it is basically a form of centering. To test things out I ran the code provided by Grosser, Bumann & Wickham in Advanced R Solutions

library(bench)

# Function to generate random dataframe
create_random_df <- function(nrow, ncol) {
  random_matrix <- matrix(runif(nrow * ncol), nrow = nrow)
  as.data.frame(random_matrix)
}

# For loop function that runs on dataframe
subtract_df <- function(x, medians) {
  for (i in seq_along(medians)) {
    x[[i]] <- x[[i]] - medians[[i]]
  }
  x
}

# Same for loop function but that runs on list
subtract_list <- function(x, medians) {
  x <- as.list(x)
  x <- subtract_df(x, medians)
  list2DF(x)
}


benchmark_medians <- function(ncol) {
  df <- create_random_df(nrow = 1e4, ncol = ncol)
  medians <- vapply(df, median, numeric(1), USE.NAMES = FALSE)
  
  bench::mark(
    "data frame" = subtract_df(df, medians),
    "list" = subtract_list(df, medians),
    time_unit = "ms"
  )
}

# Results
results <- bench::press(
  ncol = c(1, 10, 50, 100, 250, 300, 400, 500, 750, 1000),
  benchmark_medians(ncol)
)
#> Running with:
#>     ncol
#>  1     1
#>  2    10
#>  3    50
#>  4   100
#>  5   250
#>  6   300
#>  7   400
#>  8   500
#>  9   750
#> 10  1000

library(ggplot2)

ggplot(
  results,
  aes(ncol, median, col = attr(expression, "description"))
) +
  geom_point(size = 2) +
  geom_smooth() +
  labs(
    x = "Number of Columns",
    y = "Execution Time (ms)",
    colour = "Data Structure"
  ) +
  theme(legend.position = "top")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

reprex package (v2.0.1)

于 2021-08-08 创建

同样,没有任何异常,作者解释说:

When working directly with the data frame, the execution time grows quadratically with the number of columns in the input data. This is because (e.g.) the first column must be copied n times, the second column n-1 times, and so on. When working with a list, the execution time increases only linearly.

Obviously in the long run, linear growth creates shorter run-times, but there is some cost to this strategy — we have to convert between data structures with as.list() and list2DF(). Even though this is fast and probably doesn’t hurt much, the improved approach doesn’t really pay off in this scenario until we get to a data frame that is about 300 columns wide (with the exact value depending on the characteristics of the system running the code).

但后来,我心想,我经常使用数据框,而不是列表,我肯定能够编写一个函数,在包含许多列的数据框上及时运行。我错了!我现在比刚开始时有更多问题:for 循环难道不应该比矢量化操作慢吗? R 的第 4 版是否获得了我不知道的新优化?为什么 sweep 函数这么慢? matrix/vectorized 操作不应该至少和循环一样快吗?垃圾收集器会显着减慢速度吗?我在函数内部做错了什么(我肯定在其中一些函数中)?

library(bench)
  
# Function to generate random dataframe
create_random_df <- function(nrow, ncol) {
  random_matrix <- matrix(runif(nrow * ncol), nrow = nrow)
  as.data.frame(random_matrix)
}

# For loop function that runs on dataframe
subtract_df <- function(x, medians) {
  for (i in seq_along(medians)) {
    x[[i]] <- x[[i]] - medians[[i]]
  }
  x
}

# Same for loop function but that runs on list
subtract_list <- function(x, medians) {
  x <- as.list(x)
  x <- subtract_df(x, medians)
  list2DF(x)
}
  
# 5 Functions that I thought should be decently fast 
my_mapply <- function(x, medians) {
  as.data.frame(
    mapply(
      function(x, medians){x - medians},
      x,
      medians
    )  
  )
} 

my_sweep <- function(x, medians) {
  sweep(x, 2, medians)
}

my_apply <- function(x, medians) {
  x <- t(apply(x, 1, function(a) a - medians))
  as.data.frame(x)
}

my_vectorized <-function(x, medians) { 
  x <- as.matrix(x)
  x <- t(t(x) - medians)
  x <- as.data.frame(x)
  x
}  

my_vectorized2 <- function(x, medians) {
  ones <- rep(1, nrow(x))
  x_median <- ones %*% t(medians)
  x - x_median
}


benchmark_medians <- function(ncol) {
  df <- create_random_df(nrow = 1e4, ncol = ncol)
  medians <- vapply(df, median, numeric(1), USE.NAMES = FALSE)
  
  bench::mark(
    "data frame" = subtract_df(df, medians),
    "list" = subtract_list(df, medians),
    "my_mapply" = my_mapply(df, medians), 
    "my_sweep" = my_sweep(df, medians),
    # "my_apply" = my_apply(df, medians),   # excluded because it is very very slow compared with the others
    "my_vectorized" = my_vectorized(df, medians),
    "my_vectorized2" = my_vectorized2(df, medians),
    time_unit = "ms"
  )
}

# Have a quick check on dataframe with only 3 columns
benchmark_medians(3)
#> # A tibble: 6 x 6
#>   expression        min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 data frame     0.0463 0.0884    12769.    4.44MB     55.9
#> 2 list           0.0640 0.0693    12944.  486.86KB    127. 
#> 3 my_mapply      0.152  0.160      5772.    1.05MB    124. 
#> 4 my_sweep       1.14   1.18        809.    2.21MB     38.7
#> 5 my_vectorized  0.208  0.212      4253.     1.1MB     89.2
#> 6 my_vectorized2 0.844  0.875      1074.    1.93MB     46.4

# Results
results <- bench::press(
  ncol = c(1, 10, 50, 100, 250, 300, 400, 500, 750, 1000),
  benchmark_medians(ncol)
)
#> Running with:
#>     ncol
#>  1     1
#>  2    10
#>  3    50
#>  4   100
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  5   250
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  6   300
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  7   400
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  8   500
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  9   750
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 10  1000
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

library(ggplot2)

ggplot(
  results,
  aes(ncol, median, col = attr(expression, "description"))
) +
  geom_point(size = 2) +
  geom_smooth() +
  labs(
    x = "Number of Columns",
    y = "Execution Time (ms)",
    colour = "Data Structure"
  ) +
  theme(legend.position = "top")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

reprex package (v2.0.1)

于 2021-08-08 创建

无论生成数据的特征如何,这些结果在我的所有测试中都非常一致。 感谢任何见解,感谢任何有关性能或某些 function/family 功能使用的建议。

谢谢。

Aren't for loops supposed to be slower than vectorized operations? / Shouldn't matrix/vectorized operations be at least as fast as the loops?

视情况而定。但是,我认为对矢量化操作存在误解(和data.frames)。考虑 my_vectorized2():

my_vectorized2 <- function(x, medians) {
  ones <- rep(1, nrow(x))
  x_median <- ones %*% t(medians)
  x - x_median
}

这里的“瓶颈”是第 4 行。与您假设的相反(我假设),x - medians 不是 触发向量化计算:x 不是矩阵而是 data.frame(即,一个列表)。 data.frames 的算术之所以有效,是因为已经实现了方便的方法。这不应该被认为是理所当然的,可以在下面的例子中看到。

# `+` method for data.frames
`as.data.frame(1:5) + as.data.frame(6:10)`.

1   7
2   9
3  11
4  13
5  15

# no `+` method for lists!
`as.list(1:5) + as.list(6:10)`

Error in as.list(1:5) + as.list(6:10) : 
  non-numeric argument to binary operator

然而,这些方便的 data.frames 方法,简而言之,效率低于基于向量的计算。通过剖析代码,我们可以仔细看看时间花在了哪里:

df <- create_random_df(nrow = 1e4, ncol = 1000)
profvis::profvis(my_vectorized2(df, medians))

到目前为止,Ops.data.frame() 花费的时间最多,似乎是 "-" 方法拆分框架。

why is the sweep function so slow? / does the garbage collector slow things down significantly?

我不太熟悉 sweep() 的内部工作原理,但性能不佳主要是由于上述问题造成的。此外,还会创建多个引用,从而产生副本。垃圾收集似乎也占有相当的份额(这已通过分析得到证实)。这类似于您使用 mapply()apply() 的方法。 my_vectorized() 比较快,因为计算 向量化的。但是,从 data.frame 到 data.frame 的转换以及与 t() 的换位代价高昂(因为它会导致复制),因此应该避免。

总而言之,一条建议是,当您在算术运算中追求性能时,通常使用矩阵而不是 data.frames。