有没有比 fread() 更快的读取大数据的方法？

Question

您好，首先我已经在堆栈和 google 上搜索并找到了这样的帖子： Quickly reading very large tables as dataframes。虽然这些很有用且回答得很好，但我正在寻找更多信息。

我正在寻找 read/import "big" 最大 50-60GB 数据的最佳方式。我目前正在使用 data.table 中的 fread() 函数，这是我目前知道的最快的函数。我工作的 pc/server 有一个很好的 cpu（工作站）和 32 GB RAM，但仍然有超过 10GB 的数据，有时接近数十亿的观察值需要很多时间才能读取。

我们已经有 sql 个数据库，但由于某些原因，我们必须在 R 中处理大数据。当涉及到像这样的大文件时，有没有办法加快 R 或者比 fread() 更好的选择？

谢谢。

编辑：fread("data.txt", verbose = TRUE)

omp_get_max_threads() = 2
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=2, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file C://somefolder/data.txt
  File opened, size = 1.083GB (1163081280 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be [=10=] terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<ID,Dat,No,MX,NOM_TX>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 100 lines of 5 fields using quote rule 0
  Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<ID,Dat,No,MX,NOM_TX>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (1163081278 bytes from row 1 to eof) / (2 * 5778 jump0size) == 100647
  Type codes (jump 000)    : 5A5AA  Quote rule 0
  Type codes (jump 100)    : 5A5AA  Quote rule 0
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10054 sample rows
  =====
  Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 1163081249
  Line length: mean=56.72 sd=20.65 min=25 max=128
  Estimated number of rows: 1163081249 / 56.72 = 20506811
  Initial alloc = 41013622 rows (20506811 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5A5AA
[10] Allocate memory for the datatable
  Allocating 5 column slots (5 - 0 dropped) with 41013622 rows
[11] Read the data
  jumps=[0..1110), chunk_size=1047820, total_size=1163081249
|--------------------------------------------------|
|==================================================|
Read 20935277 rows x 5 columns from 1.083GB (1163081280 bytes) file in 00:31.484 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : int32     '5'
         3 : string    'A'
=============================
   0.007s (  0%) Memory map 1.083GB file
   0.739s (  2%) sep=',' ncol=5 and header detection
   0.001s (  0%) Column type detection using 10054 sample rows
   1.809s (  6%) Allocation of 41013622 rows x 5 cols (1.222GB) of which 20935277 ( 51%) rows used
  28.928s ( 92%) Reading 1110 chunks (0 swept) of 0.999MB (each chunk 18860 rows) using 2 threads
   +   26.253s ( 83%) Parse to row-major thread buffers (grown 0 times)
   +    2.639s (  8%) Transpose
   +    0.035s (  0%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
  31.484s        Total

Answer 1

您可以使用 select = columns 只加载相关的列而不会使您的内存饱和。例如：

dt <- fread("./file.csv", select = c("column1", "column2", "column3"))

我用 read.delim() 读取了 fread() 无法完全加载的文件。因此，您可以将数据转换为 .txt 并使用 read.delim().

但是，为什么不打开与要从中提取数据的 SQL 服务器的连接。您可以使用 library(odbc) 打开与 SQL 服务器的连接，并像往常一样编写查询。您可以通过这种方式优化内存使用。

查看 this short introduction 到 odbc。

Answer 2

假设您希望将文件完全读入 R，使用数据库或选择 columns/rows 的子集不会有太大帮助。

在这种情况下有帮助的是：
- 确保您使用的是 data.table
的最新版本 - 确保设置最佳线程数
使用 setDTthreads(0L) 使用所有可用线程，默认情况下 data.table 使用 50% 的可用线程。
- 检查 fread(..., verbose=TRUE) 的输出，并可能将其添加到您的问题中
- 将您的文件放在快速磁盘或 RAM 磁盘上，然后从那里读取

如果你的数据有很多不同的字符变量，你可能无法获得很好的速度，因为填充 R 的内部全局字符缓存是单线程的，因此解析可以进行得很快但创建字符向量将成为瓶颈。

有没有比 fread() 更快的读取大数据的方法？

Is there a faster way than fread() to read big data?

r

fread

bigdata

data.table