为什么 vroom 这么慢?
Why is vroom so slow?
我有一个简单的操作,我读取了几个 csvs,绑定它们,然后导出,但是 vroom
执行起来比其他方法慢得多。我一定是做错了什么,但我不确定是什么,也不知道为什么。
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
write_csv(mtcars, "test.csv")
microbenchmark(
readr={
t <- read_csv("test.csv", col_types=cols())
write_csv(t, "test.csv")
},data.tabl={
t <- fread("test.csv")
fwrite(t, "test.csv", sep=",")
},vroom={
t <- vroom("test.csv", delim=",", show_col_types = F)
vroom_write(t, "test.csv", delim=",")
},
times=10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> readr 12.636961 12.662955 15.865400 12.928211 13.503029 41.104583 10
#> data.tabl 2.200815 2.275252 2.633456 2.342797 2.529283 4.830134 10
#> vroom 57.376353 57.915135 64.280365 58.496847 58.966311 117.150837 10
由 reprex package (v2.0.0)
于 2021-07-01 创建
为了使用更多数据进行测试,我使用了来自 https://www.datosabiertos.gob.pe/dataset/vacunaci%C3%B3n-contra-covid-19-ministerio-de-salud-minsa 的 CSV,其中包含 7.3+ 百万行,并使用了您的代码的细微变化:
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
csv_file <- "vacunas_covid.csv.gz"
microbenchmark(
readr={
t <- read_csv(csv_file, col_types=cols())
write_csv(t, csv_file)
},data.table={
t <- fread(csv_file)
fwrite(t, csv_file, sep=",")
},vroom={
t <- vroom(csv_file, delim=",", show_col_types = F)
vroom_write(t, csv_file, delim=",")
},
times=5
)
结果是:
Unit: seconds
expr min lq mean median uq max neval cld
readr 101.72094 105.75384 109.16869 106.08111 108.06967 124.21788 5 c
data.table 28.18751 30.32570 31.06592 30.44838 33.12746 33.24055 5 a
vroom 48.65399 51.52445 55.78264 52.89823 53.83582 72.00071 5 b
从结果来看,vroom
使用大数据集至少比 readr
快 2 倍,data.table
比 vroom
快 ~1.7 倍。也许原始示例的问题在于数据很小,vroom
执行的索引导致了差异。
以防万一代码和结果位于:https://gist.github.com/jmcastagnetto/fef3f3a2778028e7efb6836d6d8e3f8e
我有一个简单的操作,我读取了几个 csvs,绑定它们,然后导出,但是 vroom
执行起来比其他方法慢得多。我一定是做错了什么,但我不确定是什么,也不知道为什么。
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
write_csv(mtcars, "test.csv")
microbenchmark(
readr={
t <- read_csv("test.csv", col_types=cols())
write_csv(t, "test.csv")
},data.tabl={
t <- fread("test.csv")
fwrite(t, "test.csv", sep=",")
},vroom={
t <- vroom("test.csv", delim=",", show_col_types = F)
vroom_write(t, "test.csv", delim=",")
},
times=10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> readr 12.636961 12.662955 15.865400 12.928211 13.503029 41.104583 10
#> data.tabl 2.200815 2.275252 2.633456 2.342797 2.529283 4.830134 10
#> vroom 57.376353 57.915135 64.280365 58.496847 58.966311 117.150837 10
由 reprex package (v2.0.0)
于 2021-07-01 创建为了使用更多数据进行测试,我使用了来自 https://www.datosabiertos.gob.pe/dataset/vacunaci%C3%B3n-contra-covid-19-ministerio-de-salud-minsa 的 CSV,其中包含 7.3+ 百万行,并使用了您的代码的细微变化:
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
csv_file <- "vacunas_covid.csv.gz"
microbenchmark(
readr={
t <- read_csv(csv_file, col_types=cols())
write_csv(t, csv_file)
},data.table={
t <- fread(csv_file)
fwrite(t, csv_file, sep=",")
},vroom={
t <- vroom(csv_file, delim=",", show_col_types = F)
vroom_write(t, csv_file, delim=",")
},
times=5
)
结果是:
Unit: seconds
expr min lq mean median uq max neval cld
readr 101.72094 105.75384 109.16869 106.08111 108.06967 124.21788 5 c
data.table 28.18751 30.32570 31.06592 30.44838 33.12746 33.24055 5 a
vroom 48.65399 51.52445 55.78264 52.89823 53.83582 72.00071 5 b
从结果来看,vroom
使用大数据集至少比 readr
快 2 倍,data.table
比 vroom
快 ~1.7 倍。也许原始示例的问题在于数据很小,vroom
执行的索引导致了差异。
以防万一代码和结果位于:https://gist.github.com/jmcastagnetto/fef3f3a2778028e7efb6836d6d8e3f8e