如何分段读取大文件？

Question

我目前正在使用小文件进行测试，一旦它起作用就会扩大规模。

我创建了一个文件 bigFile.txt，其中包含：

ABCDEFGHIJKLMNOPQRSTUVWXYZ

我运行这是为了分割从文件中读取的数据：

#include <iostream>
#include <fstream>
#include <memory>
using namespace std;

int main()
{
    ifstream file("bigfile.txt", ios::binary | ios::ate);
    cout << file.tellg() << " Bytes" << '\n';

    ifstream bigFile("bigfile.txt");
    constexpr size_t bufferSize = 4;
    unique_ptr<char[]> buffer(new char[bufferSize]);
    while (bigFile)
    {
        bigFile.read(buffer.get(), bufferSize);
        // print the buffer data
        cout << buffer.get() << endl;
    }
}

这给了我以下结果：

26 Bytes
ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZWX

注意在 'Z' 之后的最后一行中字符 'WX' 是如何再次重复的？

如何摆脱它，使其在到达终点后停止？

Answer 1

cout << buffer.get() 使用 const char* 重载，打印 NULL-terminated C string.

但是您的缓冲区不是 NULL-terminated，并且 istream::read() 可以读取比缓冲区大小更少的字符。因此，当您打印 buffer 时，您最终会打印已经存在的旧字符，直到遇到下一个 NULL 字符。

使用istream::gcount() to determine how many characters were read, and print exactly that many characters. For example, using std::string_view:

#include <iostream>
#include <fstream>
#include <memory>
#include <string_view>
using namespace std;

int main()
{
    ifstream file("bigfile.txt", ios::binary | ios::ate);
    cout << file.tellg() << " Bytes" << "\n";
    file.seekg(0, std::ios::beg); // rewind to the beginning

    constexpr size_t bufferSize = 4;
    unique_ptr<char[]> buffer = std::make_unique<char[]>(bufferSize);
    while (file)
    {
        file.read(buffer.get(), bufferSize);
        auto bytesRead = file.gcount();
        if (bytesRead == 0) {
            // EOF
            break;
        }
        // print the buffer data
        cout << std::string_view(buffer.get(), bytesRead) << endl;
    }
}

另请注意，无需再次打开文件 - 您可以将原始文件倒回开头并阅读。

Answer 2

问题是您没有覆盖缓冲区的内容。这是您的代码的作用：

它读取文件的开头
当到达 'YZ' 时，它读取它并且 仅覆盖缓冲区的前两个字符 （'U' 和 'V'），因为它已到达文件末尾。

一个简单的解决方法是在读取每个文件之前清除缓冲区：

#include <iostream>
#include <fstream>
#include <array>

int main()
{
    std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
    int fileSize = bigFile.tellg();
    std::cout << bigFile.tellg() << " Bytes" << '\n';

    bigFile.seekg(0);
    
    constexpr size_t bufferSize = 4;
    std::array<char, bufferSize> buffer;
    
    while (bigFile)
    {
        for (int i(0); i < bufferSize; ++i)
            buffer[i] = '[=10=]';
        bigFile.read(buffer.data(), bufferSize);
        // Print the buffer data
        std::cout.write(buffer.data(), bufferSize) << '\n';
    }
}

我也改了：

std::unique_ptr<char[]> 到 std::array 因为我们在这里不需要动态分配而且 std::arrays 比 C-style 数组更安全
std::cout.write 的打印指令，因为它导致了未定义的行为（参见 @paddy 的评论）。 std::cout << 打印 null-terminated 字符串（以 '[=17=]' 字符结尾的字符序列），而 std::cout.write 打印固定数量的字符
第二个文件打开调用 std::istream::seekg 方法（参见 @rustyx 的回答）。

另一种（而且很可能更有效）的方法是逐字符读取文件，将它们放入缓冲区，并在缓冲区已满时打印缓冲区。然后我们打印缓冲区，如果它还没有在主 for 循环中。

#include <iostream>
#include <fstream>
#include <array>

int main()
{
    std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
    int fileSize = bigFile.tellg();
    std::cout << bigFile.tellg() << " Bytes" << '\n';

    bigFile.seekg(0);
    
    constexpr size_t bufferSize = 4;
    std::array<char, bufferSize> buffer;
    
    int bufferIndex;
    for (int i(0); i < fileSize; ++i)
    {
        // Add one character to the buffer
        bufferIndex = i % bufferSize;
        buffer[bufferIndex] = bigFile.get();
        // Print the buffer data
        if (bufferIndex == bufferSize - 1)
            std::cout.write(buffer.data(), bufferSize) << '\n';
    }
    // Override the characters which haven't been already (in this case 'W' and 'X')
    for (++bufferIndex; bufferIndex < bufferSize; ++bufferIndex)
        buffer[bufferIndex] = '[=11=]';
    // Print the buffer for the last time if it hasn't been already
    if (fileSize % bufferSize /* != 0 */)
        std::cout.write(buffer.data(), bufferSize) << '\n';
}

如何分段读取大文件？

How to read large files in segments?

c++

memory

buffer