在 Rcpp 中快速高效地创建字符 DataFrame
Fast and efficient character DataFrame creation in Rcpp
我编写了一个将字符值读入 std::vector<std::string>
向量的解析器,它在亚秒内解析了 100 万条记录。然后我想将向量转换为 Rcpp::DataFrame
,这需要超过 80 秒...
有没有一种方法可以“有效地”从大型字符向量创建 Rcpp::DataFrame
?
当使用数值时,我会尝试 std::memcpy()
将 std::vector
改为 Rcpp::NumericVector
(有关更多信息,请参阅此 int64 example or this data.table example),但这似乎并不由于大小不同,因此使用字符向量。
更多信息
函数的基本思想是解析数独字符串数据(每个数独字符串恰好81个字符长),每行有两个数独(数据保存为.csv
文件,你可以找到数据 here).
$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945
在 cpp 读取函数内部,我 fread()
文件,填充缓冲区 (buffer
) 并将数据解析为上述 std::vector<std::string>
向量(a
和b
在这个例子中)
请注意,可以找到完整的代码,包括我目前所做的实验 in this gist。
const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case
FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..
int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
int i = 1;
while (i < buffer) {
// buffer from i to i + 81 would look like this:
// 004300209005009001070060043006002087190007400050083000600000105003508690042910300
// whereas for b it looks from i to i + 81 like this:
// 864371259325849761971265843436192587198657432257483916689734125713528694542916378
a[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
b[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
line++;
}
// check next buffer, not shown here...
}
// NEXT: parse the data to an R structure
这适用于 250 毫秒以下的 100 万行数据集。
然后我想从a
和b
这两个向量创建一个Rcpp::DataFrame
,这就是问题所在。转换到R对象大约需要80秒.
考虑到数据知识,是否有更快的替代方案(每行 2 项,每行 81 个字符,100 万行,...)?
我并不一定要先填std::vector
,如果可能我也可以直接将数据收集到Rcpp
结构中。
到目前为止我尝试了什么
课本解
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
先列出
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.frame");
字符矩阵
不能真正与其他方法相比,但感觉更原生...
// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);
// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;
// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());
Full Code and Timings on Github
这不是我问题的正确答案,更多的是我不想浪费的一些想法以及一些基准。可能对面临类似问题的人有用。
回想一下,基本思想是将两个 81 个字符长的字符串的 100 万行读入 R 对象(最好是 data.frame、data.table 或 tibble)。
对于基准测试,我使用了 Kyubyong Park.
的 100 万数独数据集
我将答案分为两部分:1) 使用其他 R 包和 2) 使用 Rcpp/C++ 和 C 在较低级别上工作。
令人惊讶的是,对于字符数据,诸如 stringi
、stringfish
或 vroom
之类的专用包非常高效并且击败了(我的)较低级别的 C++/C 代码。
需要注意的重要一点是,一些包使用 ALTREP (see for example Francoise take on them here),这意味着数据只有在需要时才会在 R 中具体化。即,使用 vroom 加载数据需要不到 1 秒,但第一个操作(需要具体化数据)需要更长的时间......为了绕过这个,我要么通过将数据放入 data.table 或使用 vroom 的内部函数来强制它。
1) R 包
data.table 和 fread
- 75 秒
主要作为基准。
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed
Vroom 实现 - 19 秒
请注意,vroom 使用 ALTREP,强制物质化以平衡竞争环境!
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed
Stringfish - 19 秒
Stringfish 使用 ALTREP,因此读取数据和获取子字符串只需不到一秒钟的时间。剩下的就是物化,类似于 vroom。
library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)
dt <- data.table::data.table(
uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed
Stringi - 22 秒
请注意,转换为 data.table 几乎不需要时间。
tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]
dt <- data.table::data.table(
uns = stringi::stri_sub(a, 1, 81),
sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc()
#> 22.409 sec elapsed
2) C 和 Cpp 函数
Rcpp ifstream
首先读取到 STL - 22 秒
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/
带有 ifstream
的 Rcpp 直接读取到 Rcpp::CharacterVector
- 21 秒
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/
带缓冲区的 Rcpp - 75 秒
这基本上是我选择的初始方法,如上述问题所述。
不太确定为什么它比其他的慢...
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
const int max_buffer_size = 1e8;
const int header_size = 18; // only fixed in this example...
const int n_lines = 1000000;
FILE* infile;
infile = fopen(filename.c_str(), "r");
if (infile == NULL) Rcpp::stop("File Error!\n");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
// initiate the buffers
char* buffer;
int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
? file_size : max_buffer_size;
buffer = (char*) malloc(buffer_size);
// skip the header...
int64_t this_buffer_size = fread(buffer, 1, header_size, infile);
// a holds the first part (quizzes or unsolved) b holds solution/solved
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
int l = 0;
// fill the buffer
int current_pos = ftell(infile);
int next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
// read a buffer from current_pos to ftell(infile)
Rcpp::checkUserInterrupt();
int i = 0;
while (i + line_length <= this_buffer_size) {
a[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
b[l] = std::string(buffer + i, buffer + i + 81);;
i += 82;
l++;
}
if (i == 0) break;
if (i != this_buffer_size) {
// file pointer reset by i - this_buffer_size (offset to end of buffer)
fseek(infile, i - this_buffer_size, SEEK_CUR);
}
// determine the next buffer size. If the buffer is too large, take only whats
// needed
current_pos = ftell(infile);
next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
}
free(buffer);
fclose(infile);
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/
使用 Rs C API - 125 秒
不知道为什么这不是更快,可能是因为我的 C 代码效率不高...如果您有任何改进,我很乐意更新计时。
mkChar()
函数创建一个 CHARSXP
,可以将其插入字符向量 STRSXP
。请注意,大多数 R 字符都存储在缓存中 (see also 1.10 of R Internals),也许如果我们可以绕过缓存,我们可以获得一些加速 - 不确定该怎么做或者这是否明智...
最好,我想预先分配大小为 81 的 1 百万 STRSXP
,memcpy()
来自 C 数组的值,以及 SET_STRING_ELT()
它们到向量。不过不知道该怎么做。
另请参阅:
- https://cran.r-project.org/doc/manuals/r-release/R-ints.html
- http://adv-r.had.co.nz/C-interface.html
- https://github.com/hadley/r-internals/
read_to_list_sexp <- inline::cfunction(c(fname = "character"), '
const char * filename = CHAR(asChar(fname));
FILE* infile;
infile = fopen(filename, "r");
if (infile == NULL) error("File cannot be opened");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
const int n_lines = 1000000;
SEXP uns = PROTECT(allocVector(STRSXP, n_lines));
SEXP sol = PROTECT(allocVector(STRSXP, n_lines));
char * line = NULL;
size_t len = 0;
ssize_t read;
int l = 0;
char char_array[82];
char_array[81] = 0;
// skip header
read = getline(&line, &len, infile);
while ((read = getline(&line, &len, infile)) != -1) {
memcpy(char_array, line, 81);
SET_STRING_ELT(uns, l, mkChar(char_array));
memcpy(char_array, line + 82, 81);
SET_STRING_ELT(sol, l, mkChar(char_array));
l++;
if (l == n_lines) break;
}
fclose(infile);
SEXP res = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(res, 0, uns);
SET_VECTOR_ELT(res, 1, sol);
UNPROTECT(3);
return res;
')
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- foo(file)
df <- data.table::as.data.table(a)
tictoc::toc()
#> 125.514 sec elapsed
感谢您制作可用数据的快照(顺便说一句:没必要 tar 编辑单个文件,您可以 xz
编辑 csv 文件。无论如何。)
我在 Ubuntu 20.04 盒子上得到了不同的结果,这更接近我的预期:
data.table::fread()
正如我们预期的那样具有竞争力(我从 git
运行 宁 data.table
因为在最近的版本中有回归)
vroom
和 stringfish
,一旦我们强制实体化比较苹果与苹果而不是苹果的图像,它们就差不多了
Rcpp
也在大概范围内,但变数更大一些
我把它限制在 10 运行 秒,如果你 运行 更多,可变性可能会下降,但缓存也会影响它。
简而言之:没有明确的赢家,当然也没有更换(已知已调整的)参考实现之一的授权。
edd@rob:~/git/Whosebug/65043010(master)$ Rscript bm.R
Unit: seconds
expr min lq mean median uq max neval cld
fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939 10 a
vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921 10 a
sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975 10 a
rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847 10 a
rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851 10 a
edd@rob:~/git/Whosebug/65043010(master)$
顶级脚本代码
suppressMessages({
library(data.table)
library(Rcpp)
library(vroom)
library(stringfish)
library(microbenchmark)
})
vroomread <- function(csvfile) {
a <- vroom(csvfile, col_types = "cc", progress = FALSE)
vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
a <- sf_readLines(csvfile)
dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163))
}
sourceCpp("rcppfuncs.cpp")
csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
vroom=vroomread(csvfile),
sfish=sfread(csvfile),
rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
times=10)
Rcpp 脚本代码
#include <Rcpp.h>
#include <fstream>
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
我编写了一个将字符值读入 std::vector<std::string>
向量的解析器,它在亚秒内解析了 100 万条记录。然后我想将向量转换为 Rcpp::DataFrame
,这需要超过 80 秒...
有没有一种方法可以“有效地”从大型字符向量创建 Rcpp::DataFrame
?
当使用数值时,我会尝试 std::memcpy()
将 std::vector
改为 Rcpp::NumericVector
(有关更多信息,请参阅此 int64 example or this data.table example),但这似乎并不由于大小不同,因此使用字符向量。
更多信息
函数的基本思想是解析数独字符串数据(每个数独字符串恰好81个字符长),每行有两个数独(数据保存为.csv
文件,你可以找到数据 here).
$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945
在 cpp 读取函数内部,我 fread()
文件,填充缓冲区 (buffer
) 并将数据解析为上述 std::vector<std::string>
向量(a
和b
在这个例子中)
请注意,可以找到完整的代码,包括我目前所做的实验 in this gist。
const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case
FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..
int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
int i = 1;
while (i < buffer) {
// buffer from i to i + 81 would look like this:
// 004300209005009001070060043006002087190007400050083000600000105003508690042910300
// whereas for b it looks from i to i + 81 like this:
// 864371259325849761971265843436192587198657432257483916689734125713528694542916378
a[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
b[line] = std::string(buffer + i, buffer + i + 81);
i += 81 + 1; // skip to the next value, +1 for the , or a newline
line++;
}
// check next buffer, not shown here...
}
// NEXT: parse the data to an R structure
这适用于 250 毫秒以下的 100 万行数据集。
然后我想从a
和b
这两个向量创建一个Rcpp::DataFrame
,这就是问题所在。转换到R对象大约需要80秒.
考虑到数据知识,是否有更快的替代方案(每行 2 项,每行 81 个字符,100 万行,...)?
我并不一定要先填std::vector
,如果可能我也可以直接将数据收集到Rcpp
结构中。
到目前为止我尝试了什么
课本解
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
先列出
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.frame");
字符矩阵
不能真正与其他方法相比,但感觉更原生...
// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);
// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;
// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());
Full Code and Timings on Github
这不是我问题的正确答案,更多的是我不想浪费的一些想法以及一些基准。可能对面临类似问题的人有用。
回想一下,基本思想是将两个 81 个字符长的字符串的 100 万行读入 R 对象(最好是 data.frame、data.table 或 tibble)。 对于基准测试,我使用了 Kyubyong Park.
的 100 万数独数据集我将答案分为两部分:1) 使用其他 R 包和 2) 使用 Rcpp/C++ 和 C 在较低级别上工作。
令人惊讶的是,对于字符数据,诸如 stringi
、stringfish
或 vroom
之类的专用包非常高效并且击败了(我的)较低级别的 C++/C 代码。
需要注意的重要一点是,一些包使用 ALTREP (see for example Francoise take on them here),这意味着数据只有在需要时才会在 R 中具体化。即,使用 vroom 加载数据需要不到 1 秒,但第一个操作(需要具体化数据)需要更长的时间......为了绕过这个,我要么通过将数据放入 data.table 或使用 vroom 的内部函数来强制它。
1) R 包
data.table 和 fread
- 75 秒
主要作为基准。
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed
Vroom 实现 - 19 秒
请注意,vroom 使用 ALTREP,强制物质化以平衡竞争环境!
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed
Stringfish - 19 秒
Stringfish 使用 ALTREP,因此读取数据和获取子字符串只需不到一秒钟的时间。剩下的就是物化,类似于 vroom。
library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)
dt <- data.table::data.table(
uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed
Stringi - 22 秒
请注意,转换为 data.table 几乎不需要时间。
tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]
dt <- data.table::data.table(
uns = stringi::stri_sub(a, 1, 81),
sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc()
#> 22.409 sec elapsed
2) C 和 Cpp 函数
Rcpp ifstream
首先读取到 STL - 22 秒
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/
带有 ifstream
的 Rcpp 直接读取到 Rcpp::CharacterVector
- 21 秒
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/
带缓冲区的 Rcpp - 75 秒
这基本上是我选择的初始方法,如上述问题所述。 不太确定为什么它比其他的慢...
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
const int max_buffer_size = 1e8;
const int header_size = 18; // only fixed in this example...
const int n_lines = 1000000;
FILE* infile;
infile = fopen(filename.c_str(), "r");
if (infile == NULL) Rcpp::stop("File Error!\n");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
// initiate the buffers
char* buffer;
int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
? file_size : max_buffer_size;
buffer = (char*) malloc(buffer_size);
// skip the header...
int64_t this_buffer_size = fread(buffer, 1, header_size, infile);
// a holds the first part (quizzes or unsolved) b holds solution/solved
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
int l = 0;
// fill the buffer
int current_pos = ftell(infile);
int next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
// read a buffer from current_pos to ftell(infile)
Rcpp::checkUserInterrupt();
int i = 0;
while (i + line_length <= this_buffer_size) {
a[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
b[l] = std::string(buffer + i, buffer + i + 81);;
i += 82;
l++;
}
if (i == 0) break;
if (i != this_buffer_size) {
// file pointer reset by i - this_buffer_size (offset to end of buffer)
fseek(infile, i - this_buffer_size, SEEK_CUR);
}
// determine the next buffer size. If the buffer is too large, take only whats
// needed
current_pos = ftell(infile);
next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
}
free(buffer);
fclose(infile);
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/
使用 Rs C API - 125 秒
不知道为什么这不是更快,可能是因为我的 C 代码效率不高...如果您有任何改进,我很乐意更新计时。
mkChar()
函数创建一个 CHARSXP
,可以将其插入字符向量 STRSXP
。请注意,大多数 R 字符都存储在缓存中 (see also 1.10 of R Internals),也许如果我们可以绕过缓存,我们可以获得一些加速 - 不确定该怎么做或者这是否明智...
最好,我想预先分配大小为 81 的 1 百万 STRSXP
,memcpy()
来自 C 数组的值,以及 SET_STRING_ELT()
它们到向量。不过不知道该怎么做。
另请参阅:
- https://cran.r-project.org/doc/manuals/r-release/R-ints.html
- http://adv-r.had.co.nz/C-interface.html
- https://github.com/hadley/r-internals/
read_to_list_sexp <- inline::cfunction(c(fname = "character"), '
const char * filename = CHAR(asChar(fname));
FILE* infile;
infile = fopen(filename, "r");
if (infile == NULL) error("File cannot be opened");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
const int n_lines = 1000000;
SEXP uns = PROTECT(allocVector(STRSXP, n_lines));
SEXP sol = PROTECT(allocVector(STRSXP, n_lines));
char * line = NULL;
size_t len = 0;
ssize_t read;
int l = 0;
char char_array[82];
char_array[81] = 0;
// skip header
read = getline(&line, &len, infile);
while ((read = getline(&line, &len, infile)) != -1) {
memcpy(char_array, line, 81);
SET_STRING_ELT(uns, l, mkChar(char_array));
memcpy(char_array, line + 82, 81);
SET_STRING_ELT(sol, l, mkChar(char_array));
l++;
if (l == n_lines) break;
}
fclose(infile);
SEXP res = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(res, 0, uns);
SET_VECTOR_ELT(res, 1, sol);
UNPROTECT(3);
return res;
')
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- foo(file)
df <- data.table::as.data.table(a)
tictoc::toc()
#> 125.514 sec elapsed
感谢您制作可用数据的快照(顺便说一句:没必要 tar 编辑单个文件,您可以 xz
编辑 csv 文件。无论如何。)
我在 Ubuntu 20.04 盒子上得到了不同的结果,这更接近我的预期:
data.table::fread()
正如我们预期的那样具有竞争力(我从git
运行 宁data.table
因为在最近的版本中有回归)vroom
和stringfish
,一旦我们强制实体化比较苹果与苹果而不是苹果的图像,它们就差不多了Rcpp
也在大概范围内,但变数更大一些
我把它限制在 10 运行 秒,如果你 运行 更多,可变性可能会下降,但缓存也会影响它。
简而言之:没有明确的赢家,当然也没有更换(已知已调整的)参考实现之一的授权。
edd@rob:~/git/Whosebug/65043010(master)$ Rscript bm.R
Unit: seconds
expr min lq mean median uq max neval cld
fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939 10 a
vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921 10 a
sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975 10 a
rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847 10 a
rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851 10 a
edd@rob:~/git/Whosebug/65043010(master)$
顶级脚本代码
suppressMessages({
library(data.table)
library(Rcpp)
library(vroom)
library(stringfish)
library(microbenchmark)
})
vroomread <- function(csvfile) {
a <- vroom(csvfile, col_types = "cc", progress = FALSE)
vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
a <- sf_readLines(csvfile)
dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163))
}
sourceCpp("rcppfuncs.cpp")
csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
vroom=vroomread(csvfile),
sfish=sfread(csvfile),
rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
times=10)
Rcpp 脚本代码
#include <Rcpp.h>
#include <fstream>
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}