有没有比 fread() 更快的读取大数据的方法?
Is there a faster way than fread() to read big data?
您好,首先我已经在堆栈和 google 上搜索并找到了这样的帖子:
Quickly reading very large tables as dataframes。虽然这些很有用且回答得很好,但我正在寻找更多信息。
我正在寻找 read/import "big" 最大 50-60GB 数据的最佳方式。
我目前正在使用 data.table
中的 fread()
函数,这是我目前知道的最快的函数。我工作的 pc/server 有一个很好的 cpu(工作站)和 32 GB RAM,但仍然有超过 10GB 的数据,有时接近数十亿的观察值需要很多时间才能读取。
我们已经有 sql 个数据库,但由于某些原因,我们必须在 R 中处理大数据。
当涉及到像这样的大文件时,有没有办法加快 R 或者比 fread()
更好的选择?
谢谢。
编辑:fread("data.txt", verbose = TRUE)
omp_get_max_threads() = 2
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 2 threads (omp_get_max_threads()=2, nth=2)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file C://somefolder/data.txt
File opened, size = 1.083GB (1163081280 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be [=10=] terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<ID,Dat,No,MX,NOM_TX>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 5 fields using quote rule 0
Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<ID,Dat,No,MX,NOM_TX>>
Quote rule picked = 0
fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because (1163081278 bytes from row 1 to eof) / (2 * 5778 jump0size) == 100647
Type codes (jump 000) : 5A5AA Quote rule 0
Type codes (jump 100) : 5A5AA Quote rule 0
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10054 sample rows
=====
Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 2 to the end of last row: 1163081249
Line length: mean=56.72 sd=20.65 min=25 max=128
Estimated number of rows: 1163081249 / 56.72 = 20506811
Initial alloc = 41013622 rows (20506811 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5A5AA
[10] Allocate memory for the datatable
Allocating 5 column slots (5 - 0 dropped) with 41013622 rows
[11] Read the data
jumps=[0..1110), chunk_size=1047820, total_size=1163081249
|--------------------------------------------------|
|==================================================|
Read 20935277 rows x 5 columns from 1.083GB (1163081280 bytes) file in 00:31.484 wall clock time
[12] Finalizing the datatable
Type counts:
2 : int32 '5'
3 : string 'A'
=============================
0.007s ( 0%) Memory map 1.083GB file
0.739s ( 2%) sep=',' ncol=5 and header detection
0.001s ( 0%) Column type detection using 10054 sample rows
1.809s ( 6%) Allocation of 41013622 rows x 5 cols (1.222GB) of which 20935277 ( 51%) rows used
28.928s ( 92%) Reading 1110 chunks (0 swept) of 0.999MB (each chunk 18860 rows) using 2 threads
+ 26.253s ( 83%) Parse to row-major thread buffers (grown 0 times)
+ 2.639s ( 8%) Transpose
+ 0.035s ( 0%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
31.484s Total
您可以使用 select = columns
只加载相关的列而不会使您的内存饱和。例如:
dt <- fread("./file.csv", select = c("column1", "column2", "column3"))
我用 read.delim()
读取了 fread()
无法完全加载的文件。因此,您可以将数据转换为 .txt 并使用 read.delim()
.
但是,为什么不打开与要从中提取数据的 SQL 服务器的连接。您可以使用 library(odbc)
打开与 SQL 服务器的连接,并像往常一样编写查询。您可以通过这种方式优化内存使用。
查看 this short introduction 到 odbc
。
假设您希望将文件完全读入 R,使用数据库或选择 columns/rows 的子集不会有太大帮助。
在这种情况下有帮助的是:
- 确保您使用的是 data.table
的最新版本
- 确保设置最佳线程数
使用 setDTthreads(0L)
使用所有可用线程,默认情况下 data.table
使用 50% 的可用线程。
- 检查 fread(..., verbose=TRUE)
的输出,并可能将其添加到您的问题中
- 将您的文件放在快速磁盘或 RAM 磁盘上,然后从那里读取
如果你的数据有很多不同的字符变量,你可能无法获得很好的速度,因为填充 R 的内部全局字符缓存是单线程的,因此解析可以进行得很快但创建字符向量将成为瓶颈。
您好,首先我已经在堆栈和 google 上搜索并找到了这样的帖子: Quickly reading very large tables as dataframes。虽然这些很有用且回答得很好,但我正在寻找更多信息。
我正在寻找 read/import "big" 最大 50-60GB 数据的最佳方式。
我目前正在使用 data.table
中的 fread()
函数,这是我目前知道的最快的函数。我工作的 pc/server 有一个很好的 cpu(工作站)和 32 GB RAM,但仍然有超过 10GB 的数据,有时接近数十亿的观察值需要很多时间才能读取。
我们已经有 sql 个数据库,但由于某些原因,我们必须在 R 中处理大数据。
当涉及到像这样的大文件时,有没有办法加快 R 或者比 fread()
更好的选择?
谢谢。
编辑:fread("data.txt", verbose = TRUE)
omp_get_max_threads() = 2
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 2 threads (omp_get_max_threads()=2, nth=2)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file C://somefolder/data.txt
File opened, size = 1.083GB (1163081280 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be [=10=] terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<ID,Dat,No,MX,NOM_TX>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 100 lines of 5 fields using quote rule 0
Detected 5 columns on line 1. This line is either column names or first data row. Line starts as: <<ID,Dat,No,MX,NOM_TX>>
Quote rule picked = 0
fill=false and the most number of columns found is 5
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because (1163081278 bytes from row 1 to eof) / (2 * 5778 jump0size) == 100647
Type codes (jump 000) : 5A5AA Quote rule 0
Type codes (jump 100) : 5A5AA Quote rule 0
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 10054 sample rows
=====
Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 2 to the end of last row: 1163081249
Line length: mean=56.72 sd=20.65 min=25 max=128
Estimated number of rows: 1163081249 / 56.72 = 20506811
Initial alloc = 41013622 rows (20506811 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5A5AA
[10] Allocate memory for the datatable
Allocating 5 column slots (5 - 0 dropped) with 41013622 rows
[11] Read the data
jumps=[0..1110), chunk_size=1047820, total_size=1163081249
|--------------------------------------------------|
|==================================================|
Read 20935277 rows x 5 columns from 1.083GB (1163081280 bytes) file in 00:31.484 wall clock time
[12] Finalizing the datatable
Type counts:
2 : int32 '5'
3 : string 'A'
=============================
0.007s ( 0%) Memory map 1.083GB file
0.739s ( 2%) sep=',' ncol=5 and header detection
0.001s ( 0%) Column type detection using 10054 sample rows
1.809s ( 6%) Allocation of 41013622 rows x 5 cols (1.222GB) of which 20935277 ( 51%) rows used
28.928s ( 92%) Reading 1110 chunks (0 swept) of 0.999MB (each chunk 18860 rows) using 2 threads
+ 26.253s ( 83%) Parse to row-major thread buffers (grown 0 times)
+ 2.639s ( 8%) Transpose
+ 0.035s ( 0%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
31.484s Total
您可以使用 select = columns
只加载相关的列而不会使您的内存饱和。例如:
dt <- fread("./file.csv", select = c("column1", "column2", "column3"))
我用 read.delim()
读取了 fread()
无法完全加载的文件。因此,您可以将数据转换为 .txt 并使用 read.delim()
.
但是,为什么不打开与要从中提取数据的 SQL 服务器的连接。您可以使用 library(odbc)
打开与 SQL 服务器的连接,并像往常一样编写查询。您可以通过这种方式优化内存使用。
查看 this short introduction 到 odbc
。
假设您希望将文件完全读入 R,使用数据库或选择 columns/rows 的子集不会有太大帮助。
在这种情况下有帮助的是:
- 确保您使用的是 data.table
的最新版本
- 确保设置最佳线程数
使用 setDTthreads(0L)
使用所有可用线程,默认情况下 data.table
使用 50% 的可用线程。
- 检查 fread(..., verbose=TRUE)
的输出,并可能将其添加到您的问题中
- 将您的文件放在快速磁盘或 RAM 磁盘上,然后从那里读取
如果你的数据有很多不同的字符变量,你可能无法获得很好的速度,因为填充 R 的内部全局字符缓存是单线程的,因此解析可以进行得很快但创建字符向量将成为瓶颈。