在 C++ 中加速 reading/processing 个日志文件

Question

我正在尝试读取具有以下格式的大型日志文件（大约 10G）：

3 49
7 28 
...

基本上我们有两列，每一列都是一个正整数。

这是我用来将日志文件读入两级映射的代码：

typedef std::map<int, std::map<int, std::tuple<int, int, int>>> coverage_info;

coverage_info cmap;
...

      fp = fopen("log.txt", "r");

      while (fgets(line, len, fp)) {
        sscanf(line, "%d %d", &func_id, &bb_id);

        auto it = cmap.find(func_id);
        if (it != cmap.end()) {
          auto it1 = (it->second).find(bb_id);
          if (it1 != (it->second).end()) {
            auto c =std::get<0>(cmap[func_id][bb_id]);
            cmap[func_id][bb_id] = std::make_tuple(++c, 0, 0);
          }
          else 
            cmap[func_id][bb_id] = std::make_tuple(1, 0, 0);
        }
        else {
          std::unordered_map<int, std::tuple<int, int, int>> m;
          m[bb_id] = std::make_tuple(1, 0, 0);
          cmap[func_id] = m;
        }
      }

当读取一个 7G 的日志文件时，上面的代码需要大约 30 分钟才能完成，这对我来说太慢了。

我想到了以下优化方式：

使用 unordered_map 而不是 map 来加速 find 从 log(n) 到 O(1)。这将我的案例的处理时间从 30 分钟减少到大约 15 分钟。
也许首先对日志文件进行排序 --> 但是由于使用 unordered_map 对我来说是一个不错的选择，现在对日志文件进行排序没有太大意义？
尝试读取一个缓冲区（例如 10M）而不是每个 fgets 一行？也许我可以使用 std::ifstream 或带有大缓冲区的东西？
使用mmap？但是“平台无关”对我来说是个好主意..
压缩coverage_info地图设计，将第一列和第二列一起使用成一个键而不是两级地图。但同样，这怎么能加快已经 O(1) table 的查询？

还有什么？谁能阐明潜在的优化选项？如有任何建议，我们将不胜感激。

Answer 1

I am trying to read a large log files (around 10G)....

考虑将工作分为两个阶段：

读取文本文件的批处理，然后用它填充一些数据库。你可以使用 sqlite or PostGreSQL as your database, or some NoSQL database (perhaps MongoDB). Of course, you need to learn a bit of SQL if you want to use some RDBMS. You might consider using some indexed file library instead of a database, such as Tokyo Cabinet.
正确的日志处理将发生在数据库本身。

您还可以使用更好的 parsing techniques. Parser generators such as ANTLR, or GNU bison with flex 想到。请注意，sscanf 不是解析某些文本文件的最快方法。

在 Linux 上，不要指望比 wc(1) or cat(1) or cp(1) 程序运行得更快。在我的 Debian/Sid 桌面上，·wc /usr/bin/emacs 需要 0.919 秒，/usr/bin/emacs 有 32 兆字节（在 SSD 磁盘上）。因此，一个 10 GB 的文件将占用 300 倍以上的时间。

当然可以考虑购买更快的硬件，例如一些SSD磁盘。注意 page cache.

Basically we have two columns, each column is an positive integer.

列和行实际上并不存在，除了约定 \n 换行符结束每一行。一个file is a sequence of bytes on most operating systems, sitting in some file system. Lines are just a convention related to using the newline character in files. Textual files are usually UTF-8编码。

为了在 Linux 上进行分析和基准测试，请考虑在编译 C++ 代码时以特定方式使用 time(1), gprof(1), strace(1), perf(1) and see time(7). You may need to invoke GCC。

由于您可以自由选择日志文件的格式，因此可能会以某种二进制格式编写，例如 XDR. Then it might become more compact, so faster to read. You could also compress it (using zlib). Consider also splitting your log file in smaller pieces (see csplit(1)): a hundred files of 100Mbytes each might be processed in parallel (in different threads or processes on a multi-core machine)。

阅读更多内容 about C++ and notice that standard containers may have their allocator. The GC handbook 可以启发您。

看了也不错operating system textbook.

Answer 2

首先你应该检查代码的哪一部分花费的时间最多。可能是 fgets/sscanf 对，或者处理地图。你应该在你在这里问这个问题之前做这个研究。如何？从循环中删除除 sscanf 之外的所有内容，然后查看现在需要多少（出于测试目的，您可能想要读取较小的文件，比如 100M）。或者使用一些分析工具。那么你应该把注意力集中在几乎所有时间的部分。
为什么要读取行然后解析呢？为什么不使用 fscanf？你的循环会这样开始：

同时（真）{ int rc = fscanf(fp, "%d %d", &func_id, &bb_id); 如果 (rc != 2) 休息;
是的，尝试使用 istream 而不是文件。
您有元组的映射，其第二个和第三个元素始终为 0。为什么是元组？您应该将它们替换为整数 (std::map<std::map<int>>)

4a。使用整数后，您可以将整个循环简化为

while (fgets(line, len, fp)) {
    sscanf(line, "%d %d", &func_id, &bb_id);
    ++cmap[func_id][bb_id];
}

或者，使用 fscanf，

while (true) {
    int rc = fscanf(fp, "%d %d", &func_id, &bb_id);
    if (rc != 2)
        break;
    ++cmap[func_id][bb_id];
}

因为std::map::operator[]没有找到就创建元素，并初始化。数字初始化为 0.

您有 张地图，但这里：

   std::unordered_map<int, std::tuple<int, int, int>> m;
   m[bb_id] = std::make_tuple(1, 0, 0);
   cmap[func_id] = m;

你在骂 unordered_map。所以它应该将 unordered_map 转换为 map.

这个东西

     auto c =std::get<0>(cmap[func_id][bb_id]);
     cmap[func_id][bb_id] = std::make_tuple(++c, 0, 0);

似乎效率很低，应该简化为 ++std::get<0>(cmap[func_id][bb_id]) .

你们编译优化了吗？（-O3，如果你使用 Linux 和 g++）。
检查 unordered_map 是否优于地图。检查所有变体：地图中的地图、unordered_map 中的地图、unordered_map 中的地图、unordered_map 中的unordered_map 中的地图。
如果两个数字都是 32 位整数，请考虑从它们中生成一个 unsigned long（z = ((unsigned long) x << 32) + y，其中 x 和 y 是无符号长整型）。如果你知道0<=x<65536, 0<=y<65536，你可能会用x << 16 + y，而x和y是unsigned
数值排序后（sort -d，如果我没记错的话），它可能运行更快，因为它会更好地使用高速缓存。但是排序本身需要时间。

在 C++ 中加速 reading/processing 个日志文件

Accelerate reading/processing log files in C++

c++

optimization

large-files

file-read

large-data