R：如何在没有 RAM 限制的情况下快速读取大型 .dta 文件

Question

我有一个 10 GB 的 .dta Stata 文件，我正在尝试将其读入 64 位 R 3.3.1。我正在使用大约 130 GB RAM（4 TB HD）的虚拟机，.dta 文件大约有 300 万行，变量在 400 到 800 之间。

我知道 data.table() 是读取 .txt 和 .csv 文件的最快方法，但是有人建议将大型 .dta 文件读取到 R 中吗？将文件作为 .dta 文件读入 Stata 需要大约 20-30 秒，尽管我需要在打开文件之前设置我的工作内存最大值（我将最大值设置为 100 GB）。

我没有尝试在 Stata 中导入 .csv，但我希望避免用 Stata 接触文件。通过 Using memisc to import stata .dta file into R 找到了解决方案，但这假设 RAM 不足。就我而言，我应该有足够的 RAM 来处理该文件。

Answer 1

我推荐haven R package。不像foreign，它可以读取最新的Stata格式：

library(haven)
data <- read_dta('myfile.dta')

不确定它与其他选项相比有多快，但是您在 R 中读取 Stata 文件的选择相当有限。我的理解是 haven 包装了一个 C 库，因此它可能是您最快的选择。

Answer 2

在 R 中加载大型 Stata 数据集的最快方法是使用 readstata13 包。我比较了 foreign、readstata13 和 haven 包在大型数据集 in this post 上的性能，结果反复表明 readstata13 是最快的可用包用于读取 R 中的 Stata 数据集。

Answer 3

由于这个post是搜索结果的顶部，我重新运行对当前版本haven和readstata13进行了基准测试。看来这两个包在这一点上不相上下，haven稍微好一些。就时间复杂度而言，它们都近似于线性作为行数的函数。

这是运行基准的代码：

sizes <- 10^(seq(2, 7, .5))

benchmark_read <- function(n_rows){
start_t_haven <- Sys.time()
maisanta_dataset <- read_dta("my_large_file.dta"), n_max = n_rows)
end_t_haven <- Sys.time()

start_t_readstata13 <- Sys.time()
maisanta_dataset <- read.dta13("my_large_file.dta", select.rows = n_rows)
end_t_readstata13 <- Sys.time()

tibble(size = n_rows, 
       haven_time = end_t_haven - start_t_haven, 
       readstata13_time = end_t_readstata13 - start_t_readstata13) %>% 
  return()
}

benchmark_results <-
lapply(sizes, benchmark_read) %>% 
  bind_rows()

R：如何在没有 RAM 限制的情况下快速读取大型 .dta 文件

R: How to quickly read large .dta files without RAM Limitations

memory

r

large-files

stata