在 C++ 中按行读取大文件

Question

我有一个将近800M的大文件，想逐行阅读。

起初我的程序写在中Python，我使用linecache.getline:

lines = linecache.getlines(fname)

耗时约1.2s

现在我想把我的程序移植到C++。

我写了这些代码：

    std::ifstream DATA(fname);
    std::string line;
    vector<string> lines;

    while (std::getline(DATA, line)){
        lines.push_back(line);
    }

但是速度很慢（需要几分钟）。如何改进？

Joachim Pileborg 提到了 mmap()，在 windows CreateFileMapping() 上会起作用。

我的代码在VS2013下运行，当我使用"DEBUG"模式时，需要162秒；

当我使用"RELEASE"模式时，只有7秒！

(非常感谢@DietmarKühl 和@Andrew)

Answer 1

对于 C++，您可以尝试这样的操作：

void processData(string str)
{
  vector<string> arr;
  boost::split(arr, str, boost::is_any_of(" \n"));
  do_some_operation(arr);
}

int main()
{
 unsigned long long int read_bytes = 45 * 1024 *1024;
 const char* fname = "input.txt";
 ifstream fin(fname, ios::in);
 char* memblock;

 while(!fin.eof())
 {
    memblock = new char[read_bytes];
    fin.read(memblock, read_bytes);
    string str(memblock);
    processData(str);
    delete [] memblock;
 }
 return 0;
}

Answer 2

首先，您可能应该确保您在编译时启用了优化。对于如此简单的算法，这可能无关紧要，但这实际上取决于您的 vector/string 库实现。

正如@angew 所建议的，std::ios_base::sync_with_stdio(false) 对您编写的例程有很大影响。

另一个较小的优化是使用 lines.reserve() 预分配向量，这样 push_back() 就不会导致大量的复制操作。但是，如果您恰好事先知道您可能收到多少行，这将非常有用。

使用上面建议的优化，我得到以下读取 800MB 文本流的结果：

 20 seconds ## if average line length = 10 characters
  3 seconds ## if average line length = 100 characters
  1 second  ## if average line length = 1000 characters

如您所见，速度由每行开销决定。此开销主要发生在 std::string class.

内部

任何基于存储大量 std::string 的方法在内存分配开销方面都可能不是最优的。在 64 位系统上，std::string 将需要 最小值 每个字符串 16 字节的开销。事实上，开销很可能会比这大得多——您会发现内存分配（在 std::string 内）成为一个重要的瓶颈。

为了优化内存使用和性能，请考虑编写自己的例程来读取大块文件，而不是使用 getline()。然后您可以应用类似于 flyweight pattern 的内容来管理使用自定义字符串 class.

的各个行的索引

P.S。另一个相关因素是物理磁盘 I/O，它可能会也可能不会被缓存绕过。

在 C++ 中按行读取大文件

Read a big file by lines in C++

c++

file

ifstream