在 Rcpp 中快速高效地创建字符 DataFrame

Question

我编写了一个将字符值读入 std::vector<std::string> 向量的解析器，它在亚秒内解析了 100 万条记录。然后我想将向量转换为 Rcpp::DataFrame，这需要超过 80 秒...

有没有一种方法可以“有效地”从大型字符向量创建 Rcpp::DataFrame？

当使用数值时，我会尝试 std::memcpy() 将 std::vector 改为 Rcpp::NumericVector（有关更多信息，请参阅此 int64 example or this data.table example），但这似乎并不由于大小不同，因此使用字符向量。

更多信息

函数的基本思想是解析数独字符串数据（每个数独字符串恰好81个字符长），每行有两个数独（数据保存为.csv文件，你可以找到数据 here).

$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945

在 cpp 读取函数内部，我 fread() 文件，填充缓冲区 (buffer) 并将数据解析为上述 std::vector<std::string> 向量（a 和b 在这个例子中）

请注意，可以找到完整的代码，包括我目前所做的实验 in this gist。

const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case

FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);

// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..

int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
  int i = 1;
  while (i < buffer) {
    // buffer from i to i + 81 would look like this:
    // 004300209005009001070060043006002087190007400050083000600000105003508690042910300
    // whereas for b it looks from i to i + 81 like this:
    // 864371259325849761971265843436192587198657432257483916689734125713528694542916378
    a[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    b[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    line++;
  }
  // check next buffer, not shown here...
}

// NEXT: parse the data to an R structure

这适用于 250 毫秒以下的 100 万行数据集。

然后我想从a和b这两个向量创建一个Rcpp::DataFrame，这就是问题所在。转换到R对象大约需要80秒.

考虑到数据知识，是否有更快的替代方案（每行 2 项，每行 81 个字符，100 万行，...）？

我并不一定要先填std::vector，如果可能我也可以直接将数据收集到Rcpp结构中。

到目前为止我尝试了什么

课本解

Rcpp::DataFrame df = Rcpp::DataFrame::create(
  Rcpp::Named("unsolved") = a,
  Rcpp::Named("solved") = b,
  Rcpp::Named("stringsAsFactors") = false
);

先列出


Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

df["unsolved"] = a;
df["solved"] = b;

df.attr("class") = Rcpp::CharacterVector::create("data.frame");

字符矩阵

不能真正与其他方法相比，但感觉更原生...

// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);


// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;


// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());

Full Code and Timings on Github

Answer 1

这不是我问题的正确答案，更多的是我不想浪费的一些想法以及一些基准。可能对面临类似问题的人有用。

回想一下，基本思想是将两个 81 个字符长的字符串的 100 万行读入 R 对象（最好是 data.frame、data.table 或 tibble）。对于基准测试，我使用了 Kyubyong Park.

的 100 万数独数据集

我将答案分为两部分：1) 使用其他 R 包和 2) 使用 Rcpp/C++ 和 C 在较低级别上工作。

令人惊讶的是，对于字符数据，诸如 stringi、stringfish 或 vroom 之类的专用包非常高效并且击败了（我的）较低级别的 C++/C 代码。

需要注意的重要一点是，一些包使用 ALTREP (see for example Francoise take on them here)，这意味着数据只有在需要时才会在 R 中具体化。即，使用 vroom 加载数据需要不到 1 秒，但第一个操作（需要具体化数据）需要更长的时间......为了绕过这个，我要么通过将数据放入 data.table 或使用 vroom 的内部函数来强制它。

1) R 包

data.table 和 `fread` - 75 秒

主要作为基准。

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed

Vroom 实现 - 19 秒

请注意，vroom 使用 ALTREP，强制物质化以平衡竞争环境！

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed

Stringfish - 19 秒

Stringfish 使用 ALTREP，因此读取数据和获取子字符串只需不到一秒钟的时间。剩下的就是物化，类似于 vroom。

library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)

dt <- data.table::data.table(
  uns = sf_substr(a, 1, 81),
  sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed

Stringi - 22 秒

请注意，转换为 data.table 几乎不需要时间。

tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]

dt <- data.table::data.table(
  uns = stringi::stri_sub(a, 1, 81),
  sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc() 
#> 22.409 sec elapsed

2) C 和 Cpp 函数

Rcpp `ifstream` 首先读取到 STL - 22 秒

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/

带有 `ifstream` 的 Rcpp 直接读取到 `Rcpp::CharacterVector` - 21 秒

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/

带缓冲区的 Rcpp - 75 秒

这基本上是我选择的初始方法，如上述问题所述。不太确定为什么它比其他的慢...

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
  const int max_buffer_size = 1e8;
  const int header_size = 18; // only fixed in this example...
  const int n_lines = 1000000;

  FILE* infile;
  infile = fopen(filename.c_str(), "r");
  if (infile == NULL) Rcpp::stop("File Error!\n");

  fseek(infile, 0L, SEEK_END);
  int64_t file_size = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  // initiate the buffers
  char* buffer;
  int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
    ? file_size : max_buffer_size;
  buffer = (char*) malloc(buffer_size);

  // skip the header...
  int64_t this_buffer_size = fread(buffer, 1, header_size, infile);

  // a holds the first part (quizzes or unsolved) b holds solution/solved
  std::vector<std::string> a, b;
  a.resize(n_lines);
  b.resize(n_lines);

  const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
  int l = 0;
  // fill the buffer
  int current_pos = ftell(infile);
  int next_buffer_size = file_size - current_pos > buffer_size
    ? buffer_size : file_size - current_pos;

  while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
    // read a buffer from current_pos to ftell(infile)
    Rcpp::checkUserInterrupt();
    int i = 0;
    while (i + line_length <= this_buffer_size) {
      a[l] = std::string(buffer + i, buffer + i + 81);
      i += 82;
      b[l] = std::string(buffer + i, buffer + i + 81);;
      i += 82;
      l++;
    }

    if (i == 0) break;
    if (i != this_buffer_size) {
      // file pointer reset by i - this_buffer_size (offset to end of buffer)
      fseek(infile, i - this_buffer_size, SEEK_CUR);
    }
    // determine the next buffer size. If the buffer is too large, take only whats
    // needed
    current_pos = ftell(infile);
    next_buffer_size = file_size - current_pos > buffer_size
      ? buffer_size : file_size - current_pos;
  }

  free(buffer);
  fclose(infile);

  Rcpp::DataFrame df = Rcpp::DataFrame::create(
    Rcpp::Named("unsolved") = a,
    Rcpp::Named("solved") = b,
    Rcpp::Named("stringsAsFactors") = false
  );
  return df;
}

/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/

使用 Rs C API - 125 秒

不知道为什么这不是更快，可能是因为我的 C 代码效率不高...如果您有任何改进，我很乐意更新计时。

mkChar() 函数创建一个 CHARSXP，可以将其插入字符向量 STRSXP。请注意，大多数 R 字符都存储在缓存中 (see also 1.10 of R Internals)，也许如果我们可以绕过缓存，我们可以获得一些加速 - 不确定该怎么做或者这是否明智...

最好，我想预先分配大小为 81 的 1 百万 STRSXP，memcpy() 来自 C 数组的值，以及 SET_STRING_ELT() 它们到向量。不过不知道该怎么做。

另请参阅：

read_to_list_sexp <- inline::cfunction(c(fname = "character"), '
  const char * filename = CHAR(asChar(fname));

  FILE* infile;
  infile = fopen(filename, "r");
  if (infile == NULL) error("File cannot be opened");

  fseek(infile, 0L, SEEK_END);
  int64_t file_size = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  const int n_lines = 1000000;

  SEXP uns = PROTECT(allocVector(STRSXP, n_lines));
  SEXP sol = PROTECT(allocVector(STRSXP, n_lines));

  char * line = NULL;
  size_t len = 0;
  ssize_t read;

  int l = 0;

  char char_array[82];
  char_array[81] = 0;
  // skip header
  read = getline(&line, &len, infile);

  while ((read = getline(&line, &len, infile)) != -1) {
    memcpy(char_array, line, 81);
    SET_STRING_ELT(uns, l, mkChar(char_array));

    memcpy(char_array, line + 82, 81);
    SET_STRING_ELT(sol, l, mkChar(char_array));

    l++;
    if (l == n_lines) break;
  }
  fclose(infile);

  SEXP res = PROTECT(allocVector(VECSXP, 2));

  SET_VECTOR_ELT(res, 0, uns);
  SET_VECTOR_ELT(res, 1, sol);

  UNPROTECT(3);
  return res;
')

file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- foo(file)
df <- data.table::as.data.table(a)
tictoc::toc()
#> 125.514 sec elapsed

Answer 2

感谢您制作可用数据的快照（顺便说一句：没必要 tar 编辑单个文件，您可以 xz 编辑 csv 文件。无论如何。）

我在 Ubuntu 20.04 盒子上得到了不同的结果，这更接近我的预期：

data.table::fread() 正如我们预期的那样具有竞争力（我从 git 运行宁 data.table 因为在最近的版本中有回归）
vroom 和 stringfish，一旦我们强制实体化比较苹果与苹果而不是苹果的图像，它们就差不多了
Rcpp 也在大概范围内，但变数更大一些

我把它限制在 10 运行秒，如果你运行更多，可变性可能会下降，但缓存也会影响它。

简而言之：没有明确的赢家，当然也没有更换（已知已调整的）参考实现之一的授权。

edd@rob:~/git/Whosebug/65043010(master)$ Rscript bm.R
Unit: seconds
  expr     min      lq    mean  median      uq     max neval cld
 fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939    10   a
 vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921    10   a
 sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975    10   a
 rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847    10   a
 rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851    10   a
edd@rob:~/git/Whosebug/65043010(master)$

顶级脚本代码

suppressMessages({
    library(data.table)
    library(Rcpp)
    library(vroom)
    library(stringfish)
    library(microbenchmark)
})

vroomread <- function(csvfile) {
    a <- vroom(csvfile, col_types = "cc", progress = FALSE)
    vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
    a <- sf_readLines(csvfile)
    dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
                                 sol = sf_substr(a, 83, 163))
}

sourceCpp("rcppfuncs.cpp")


csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
               vroom=vroomread(csvfile),
               sfish=sfread(csvfile),
               rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
               rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
               times=10)

Rcpp 脚本代码

#include <Rcpp.h>
#include <fstream>

//[[Rcpp::export]]

Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}