从文件读取时丢失字节

Question

我正在编写一些代码来合并两个 .txt 文件，这些文件包含为同一设备捕获但在不同场合获取的测试数据。数据以 .csv 格式存储。

编辑：（当它们被保存为 .txt（带有 BOM 编码的 UTF8）时，它们被格式化为看起来像 csv 文件）

在不担心组合部分的情况下，由于我对 C++ 的相对缺乏经验，我正在解决一些读取文件的问题，这时我注意到几种方法报告的文件大小与实际文件大小之间存在数千字节的不匹配实际上能够在到达 EOF 之前被读入。有谁知道这可能是什么原因造成的？

读入前检查文件大小的方法：

正在为相关文件构造一个 std::filesystem::directory_entry 对象。然后，调用它的 .file_size() 方法。 Returns 733435 字节。
正在为文件构造fstream对象，然后是以下代码：

#include <iostream>
#include <fstream>

int main() {
    std::fstream data_file(path_to_file, std::ios::in);
    int file_size; \ EDIT: Was in the wrong scope

    if (data_file.is_open()) {
        

        data_file.seekg(0, std::ios_base::end);
        file_size = data_file.tellg();
        data_file.seekg(0, std::ios_base::beg);
    }
    
    std::cout << file_size << std::endl; \ --> 733435 bytes

}

正在文件资源管理器中检查文件的属性。文件大小 = 733435 字节，磁盘大小 = 737280 字节。

然后当我读入文件如下：

#include <iostream>
#include <fstream>

int main() {
    std::fstream data_file(path_to_file, std::ios::in);
    
    if (data_file.is_open()) {
        int file_size, chars_read;

        data_file.seekg(0, std::ios_base::end);
        file_size = data_file.tellg();
        data_file.seekg(0, std::ios_base::beg);

        std::cout << "File size: " << file_size << std::endl;
        // |--> "File size: 733425"
    
        char* buffer = new char[file_size];

        // This sets both the eofbit & failbit flags for the stream
        // As is expected if the stream runs out of characters to read in
        // Before n characters are read in. (istream::read(char* s, streamsize n))
        data_file.read(buffer, file_size);

        // We can check the number of chars read in using istream::gcount()
        chars_read = data_file.gcount();

        std::cout << "Chars read: " << chars_read << std::endl;
        // |--> "Chars read: 716153"

        delete[] buffer;
        data_file.close();
    }

}

当您查看读入的内容时，神秘感会加深一些。文件是使用三种略有不同的方法读入的。

直接从文件流逐行读取数据到 std::vectorstd::string。

std::fstream stream(path_to_file, std::ios::in);
std::vector<std::string> v;
std::string s;

while (getline(stream, s, '\n')) {
    v.push_back(s);
}

如上所述使用 fstream::read(...) 读取数据，然后使用字符串流对象转换为行。

//... data read into char* buffer;
std::stringstream ss(buffer, std::ios::in);
std::vector<std::string> v2;
while (getline(ss, s, '\n')) {
    v2.push_back(s);
}

据我所知，这些应该具有相同的内容。但是...

std::cout << v.size() << std::endl;  //  --> 17283
std::cout << v2.size() << std::endl; // --> 17688

编辑：文件本身有 17283 行，最后一行是空的

总之，与预期和测量的文件大小略超过 17000 字节的不匹配，以及两种不同处理方法输出的行数之间的不匹配意味着我不知道发生了什么。

任何建议都有帮助，包括更多测试正在发生的事情的方法。

Answer 1

fstream 默认以“文本”模式打开文件。在许多平台上，这没有什么区别，但特别是在 Windows 系统上，文本模式会自动执行字符转换。 \r\n 在文件系统上将被简单地读取为 \n.

有关更多讨论，请参阅 Difference between opening a file in binary vs text。在其中一个答案中，有关于允许使用 seek() 和 tell() 的讨论。

一个简单的尝试是以二进制模式打开：或者这个标志 std::ios::binary 和你的 ::in 标志。

从文件读取时丢失字节

Missing bytes when reading from file

c++

c++17