spec_tbl_df 在与普通 tibble 相同的操作上慢 10 倍以上

Question

所以我真的很想知道为什么两个不同的 R 会话使用相同的数据会产生截然不同的时间来完成相同的任务。在多次重新启动 R、清除所有变量并真正运行一个干净的 R 之后，我发现了问题：vroom 和 readr 提供的新数据结构对于某些人来说是原因，我的剧本超级迟钝。当然，解决这个问题最简单的方法是在加载数据后立即将其转换为 tibble。或者是否有其他解释，比如我的函数中糟糕的编码实践可以解释缓慢的行为？或者，这是这些软件包最近更新的错误吗？如果是这样，并且如果有人在向 tidyverse 报告错误方面更有经验，那么这里是 repex 显示行为的原因，因为我觉得这超出了我的范围。

#Load packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(vroom)
library(tidyr)
library(microbenchmark)
#Genenrate some dummy data
ex_data <- tibble(
  sd = 1,
  mean = 1:1000,
  a1 = rnorm(1000, mean, sd),
  a2 = rnorm(1000, mean, sd),
  a3 = rnorm(1000, mean, sd)
  ) %>% 
  mutate(
    a1 = if_else(a1<mean, NA_real_, a1),
    a2 = if_else(a2<mean, NA_real_, a2),
    a3 = if_else(a3<mean, NA_real_, a3)
  )
#Wrapper function discovering the behavioure
impute_row <- function(mean, sd, data){
  if(!anyNA(data)){
    return(data)
  }else{
    data <- as.data.frame(data)
    data[is.na(data)] <-  rnorm(n = sum(is.na(data)), mean = mean, sd = sd)
    return(data)
  }
}
#Main function
imputer <- function(data){
  data %>% 
    mutate(
      data = pmap(list(mean, sd, data), impute_row)
    ) %>% 
    unnest(cols = data)
}
#Generate dummy file
out_file <- tempfile(fileext = "csv")
vroom_write(ex_data, out_file, ",")
#Read it in
ex_data_spc <- vroom(out_file, col_types = cols()) %>% 
  nest(data = -c(mean, sd))
#Nest the original data as well
ex_data <- ex_data %>% 
  nest(data = -c(mean, sd))
#Benchmark
microbenchmark(
  tib = imputer(ex_data),
  spc_tib = imputer(ex_data_spc),
  times = 10
)
#> Unit: milliseconds
#>     expr        min         lq       mean     median        uq       max neval
#>      tib   82.81192   87.45288   89.19118   90.47263   91.2216   93.4418    10
#>  spc_tib 1041.90378 1070.00579 1244.97090 1076.92022 1093.0054 2780.0722    10

^{由 reprex package (v2.0.0)}

于 2021 年 6 月 14 日创建

在最坏的情况下比运行慢将近 30 倍。

Answer 1

This 是我想到的问题。众所周知，这些问题会发生在 vroom 上，而不是 spec_tbl_df class，后者的作用并不大。

vroom 做各种事情来尝试加快阅读速度； AFAIK 主要是懒惰阅读。这就是比较两个数据集时获得所有这些不同组件的方式。

有房间：

~~~(snip)~~~
ex_data_spc <- vroom(out_file, col_types = cols()) %>% 
  nest(data = -c(mean, sd))
~~~(snip)~~~

#> Unit: milliseconds
#>     expr       min        lq     mean    median        uq       max neval cld
#>  spc_tib 1679.2088 1704.3085 2106.864 1731.6694 1942.9444 4918.4498    10   b
#>      tib  149.8716  158.8548  169.489  170.3735  182.5681  192.8533    10  a

all.equal(ex_data, ex_data_spc)
#>    [1] "Component \"data\": Component 1: Attributes: < Names: 1 string mismatch >"                                                 
#>    [2] "Component \"data\": Component 1: Attributes: < Length mismatch: comparison on first 2 components >"                        
#>    [3] "Component \"data\": Component 1: Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"   
#>    [4] "Component \"data\": Component 1: Attributes: < Component \"class\": 3 string mismatches >"                                 
#>    [5] "Component \"data\": Component 1: Attributes: < Component 2: Modes: numeric, externalptr >"  
                               
~~~(snip)~~~

有读者：

~~~(snip)~~~
ex_data_spc <- readr::read_csv(out_file, col_types = cols()) %>% 
  nest(data = -c(mean, sd))
~~~(snip)~~~
#> Unit: milliseconds
#>     expr      min       lq     mean   median       uq      max neval cld
#>  spc_tib 148.9432 161.7315 181.2137 184.4592 191.9048 219.7883    10   a
#>      tib 161.9441 166.7826 175.3644 175.3354 181.4598 197.5544    10   a

all.equal(ex_data, ex_data_spc)
#> [1] TRUE

如果你愿意，你可以post你对那个问题的代表。

spec_tbl_df 在与普通 tibble 相同的操作上慢 10 倍以上

spec_tbl_df is over 10 times slower on same opperations as a normal tibble

r

dplyr

readr

purrr