普遍而不是立即处理字节顺序

Question

我正在读取跨平台格式一致的文件，但可能是大端或小端，具体取决于构建文件的平台。所述平台由文件中的值定义。

目前，我处理字节顺序的方式是使用 if 语句，一个正常读取文件，另一个使用 byteswap intrinsics:

// source.h
class File {
public:
    enum class Endian {
        Little = 1,
        Big = 2
    };
};
// ...removed...

// source.cpp
#include "source.h"
#include <fstream>
std::ifstream file;
File::Endian endianness;

// ...removed...

bool GetPlatform() {
    uint32_t platform;
    file.read(reinterpret_cast<char*>(&platform), sizeof(platform));
    if (platform == 1) {
        endianness = File::Endian::Little;
    }
    else if (platform == 2 << 24) {
        endianness = File::Endian::Big;
    }
    // ...removed...
}

void ReadData() {
    uint32_t data;
    uint32_t dataLittle;

    if (endianness == File::Endian::Little) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
    }
    else if (endianness == File::Endian::Big) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        dataLittle = _byteswap_ulong(data);
    }
}

我的问题是，是否可以在大端时放弃每个值的交换，而是通用地设置字节序？以下是我的意思的一个潜在示例：

bool GetPlatform() {
    uint32_t platform;
    file.read(reinterpret_cast<char*>(&platform), sizeof(platform));
    if (platform == 1) {
        // Universally set the endianness to little endian
    }
    else if (platform == 2 << 24) {
        // Universally set the endianness to big endian
    }
    // ...removed...
}

void ReadData() {
    uint32_t data;
    file.read(reinterpret_cast<char*>(&data), sizeof(data)); // Data is now read correctly regardless of endianness
}

我问这个问题的主要原因是它基本上将每个函数的代码量减半，因为它不再需要 if 语句来确定字节顺序。

另外，std::endian 对这个任务有用吗？它的示例仅表明用于检测主机字节顺序，但我不确定它是否还有其他用途。

Answer 1

我认为通常的答案是 #ifdef 像 read64 这样的函数的定义：

int64_t read64(char *pos) {
#ifdef IS_BIG_ENDIAN
  ...
#elif IS_LITTLE_ENDIAN
  ...
#else
  // probably # error
#endif
}

Answer 2

"automatically" 以正确的字节顺序读取的唯一方法是让 CPU 的本机字节顺序与文件中字节的字节顺序相匹配。如果它们不匹配，那么您的代码中的某些内容需要知道从文件的字节序到 CPU 的字节序（从文件读取时）进行必要的字节交换，反之亦然（写入文件时）。

查看如何实现 ntohl() and htonl() 来处理网络顺序（又名大端）数据——在大端平台（如 PowerPC）上，它们是简单的空操作 return 他们逐字逐句的争论。在小端平台（如 Intel）上，他们 return 他们的参数字节交换。这样调用它们的代码就不必在运行时间做任何条件测试来确定字节交换是否合适，它只是无条件地运行s 它的所有数据正在通读 ntohl() 或 ntohs()，并相信他们将对所有平台上的数据做正确的事情。类似地，在写入数据时，它会在将数据发送到 file/network/whatever.

之前无条件地运行通过 htonl() 或 htons() 发送所有数据值

你的程序可以做类似的事情，要么通过调用那些实际的函数，要么（如果你需要读取比 16 位 and/or 32 位整数更多的数据类型）通过查找或编写你的自己的功能与精神上的功能相似，例如类似于：

inline uint32_t NativeToLittleEndianUint32(uint32_t val) {...}
inline uint32_t LittleEndianToNativeUint32(uint32_t val) {...}

inline uint32_t NativeToBigEndianUint32(uint32_t val) {...}
inline uint32_t BigEndianToNativeUint32(uint32_t val) {...}
[...]

inline uint64_t NativeToLittleEndianUint64(uint64_t val) {...}
inline uint64_t NativeToBigEndianUint64(uint64_t val) {...}

inline uint64_t LittleEndianToNativeUint64(uint64_t val) {...}
inline uint64_t BigEndianToNativeUint64(uint64_t val) {...}

[...]

...等等。代码中所有成千上万的 if/then 子句都消失了，取而代之的是编译时条件逻辑。这使得代码更高效、更易于测试，并且更不容易出错。如果您喜欢模板化函数，您可以使用它们来减少调用代码编写者需要记住的函数名称的数量（例如，您可以使用 inline template<T> NativeToLittleEndian(T val) {...} 和模板覆盖来为您的所有类型做正确的事情需要支持)

如果您想更进一步，可以将 reading/writing 和字节交换函数组合成一个更大的函数，从而避免为每个数据值调用两次函数。

注意：为浮点类型实现这些函数时要小心；一些 CPU 架构（例如英特尔）会隐式修改意外的浮点位模式，这意味着例如当字节序交换 32 位浮点值时，您需要将该值的 non-native/external/byte-swapped 表示存储为 uint32_t 而不是 "float"。如果您想查看我如何在我的代码中处理此问题的示例，请查看例如B_HOST_TO_BENDIAN_IFLOAT 和 B_BENDIAN_TO_HOST_IFLOAT 宏在 this file.

中的定义

Answer 3

如果我理解你的情况，你的基本问题是你缺乏抽象层次。你有一堆函数可以从你的文件中读取各种数据结构。由于这些函数直接调用std::ifstream::read，所以它们都需要知道它们正在读取的结构和文件的布局。这是两项任务，比理想情况多了一项。您最好将此逻辑分成两个抽象级别。让我们调用新级别 ReadBytes 的函数，因为它们专注于从文件中获取字节。由于 Microsoft 提供了三个字节交换内部函数，因此将有三个这样的函数。这是对 4 字节值的第一次尝试。

void ReadBytes(std::ifstream & file, File::Endian endianness, uint32_t & data) {
    file.read(reinterpret_cast<char*>(&data), sizeof(data));
    if (endianness == File::Endian::Big) {
        data = _byteswap_ulong(data);
    }
}

请注意，我已通过参数返回数据。这是为了允许所有三个函数具有相同的名称；此参数的类型告诉编译器要使用哪个重载。（还有其他方法。编码风格不同。）

还有其他改进要进行，但这足以创建新的抽象级别。从文件中读取数据的各种函数将变为如下所示。

void ReadData() {
    uint32_t data;

    ReadBytes(file, endianness, data);
    // More processing here, maybe more reads.
}

使用这个小样本代码，节省的费用并不明显。但是，您表示可能有许多函数可以充当 ReadData 的角色。这种方法将纠正字节顺序的负担从这些函数转移到新的 ReadBytes 函数。 if 语句的数量从 "hundreds, if not thousands" 减少到三个。

此更改是由通常称为 "don't repeat yourself" 的编程原则推动的。同样的原则可以激发诸如 "why is there more than one function that needs this code?"

之类的问题

另一个使您的事情复杂化的问题是，您似乎对问题采取了程序性方法，而不是 object-oriented 方法。过程方法的症状可能包括过多的函数参数（例如 endianness 作为参数）和全局变量。如果将界面包装在 class 中，界面将更易于使用。这是一个 start 来声明这样一个 class （即 header 文件的开始）。请注意，字节顺序是私有的，并且此 header 没有指示字节顺序是如何确定的。如果封装良好，class 之外的代码将不会关心哪个平台创建了文件。

// Designed as a drop-in replacement for an ifstream.
// (Non-public inheritance *might* be appropriate if you want to restrict the interface.)
class IFile : public std::ifstream {
private:
    File::Endian endianness;

public:
    // Mimic the constructors of std::ifstream that you need.
    explicit IFile(const std::string & filename);

    // It should be possible to use some template magic to simplify the
    // definition of these three functions, but since there are only three:
    void ReadBytes(uint16_t & data) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        if (endianness == File::Endian::Big) {
            data = _byteswap_ushort(data);
        }
    }
    void ReadBytes(uint32_t & data) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        if (endianness == File::Endian::Big) {
            data = _byteswap_ulong(data);
        }
    }
    void ReadBytes(uint64_t & data) {
        file.read(reinterpret_cast<char*>(&data), sizeof(data));
        if (endianness == File::Endian::Big) {
            data = _byteswap_uint64(data);
        }
    }
};

这只是一个开始。一方面，界面需要做更多的工作。此外，ReadBytes 函数可以编写得更便携一些，也许使用 std::endian 而不是假设 little-endian。（Boost 有一个 endian library 可以帮助您编写真正可移植的代码。它甚至默认使用可用的内部函数。）

字节顺序的确定在实现（源）文件中完成。这似乎应该作为打开文件的一部分来完成。我已将其作为此示例构造函数的一部分，但您可能需要更大的灵活性（使用 ifstream 接口作为指导）。无论如何，检测平台的逻辑不需要在这个 class 的实现之外访问。这是实施的开始。

// Helper function, not needed outside this class.
// This should be either static or put into an anonymous namespace.
static File::Endian ReadEndian(std::ifstream & file) {
    uint32_t platform;
    file.read(reinterpret_cast<char*>(&platform), sizeof(platform));
    if (platform == 1) {
        return File::Endian::Little;
    }
    else if (platform == 2 << 24) {
        return File::Endian::Big;
    }
    // Handle unrecognized platform here
}

IFile::IFile(const std::string & filename) : std::ifstream(filename),
    endianness(ReadEndian(file))
{}

此时，您的各种 ReadData 函数可能如下所示（不使用全局变量）。

void ReadData(IFile & file) {
    uint32_t data;

    file.ReadBytes(data);
}

这比您正在寻找的更简单，因为重复的代码更少。（转换为 char* 并获取大小不再需要在所有地方重复。）

总而言之，有两个主要方面需要改进。

不要重复自己。经常重复的代码应该移到一个单独的函数中。
Object-oriented. 依靠 objects 处理日常任务，而不是将它们委派给使用 objects.[=63 的人=]

这两者都有助于更轻松地安全地进行彻底的更改，例如支持新的字节顺序。没有 pre-built 开关来设置字节顺序，但是当您的代码组织得更好时，构建一个并不难。

普遍而不是立即处理字节顺序

Handling endianness universally rather than instantially

c++

endianness