在内存映射中使用 reinterpret_cast 时处理未定义的行为

Question

为避免复制大量数据，最好mmap一个二进制文件并直接处理原始数据。这种方法有几个优点，包括将分页委托给操作系统。不幸的是，据我了解，明显的实现会导致未定义的行为 (UB)。

我的用例如下：创建一个二进制文件，其中包含一些 header 标识格式并提供元数据（在本例中只是 double 值的数量）。文件的其余部分包含我希望处理的原始二进制值，而不必先将文件复制到本地缓冲区（这就是为什么我首先 memory-mapping 文件）。下面的程序是一个完整的（如果简单的话）示例（我相信所有标记为 UB[X] 的地方都会导致 UB）：

// C++ Standard Library
#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <fstream>
#include <iostream>
#include <numeric>

// POSIX Library (for mmap)
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

constexpr char MAGIC[8] = {"1234567"};

struct Header {
  char          magic[sizeof(MAGIC)] = {'[=11=]'};
  std::uint64_t size                 = {0};
};
static_assert(sizeof(Header) == 16, "Header size should be 16 bytes");
static_assert(alignof(Header) == 8, "Header alignment should be 8 bytes");

void write_binary_data(const char* filename) {
  Header header;
  std::copy_n(MAGIC, sizeof(MAGIC), header.magic);
  header.size = 100u;

  std::ofstream fp(filename, std::ios::out | std::ios::binary);
  fp.write(reinterpret_cast<const char*>(&header), sizeof(Header));
  for (auto k = 0u; k < header.size; ++k) {
    double value = static_cast<double>(k);
    fp.write(reinterpret_cast<const char*>(&value), sizeof(double));
  }
}

double read_binary_data(const char* filename) {
  // POSIX mmap API
  auto        fp = ::open(filename, O_RDONLY);
  struct stat sb;
  ::fstat(fp, &sb);
  auto data = static_cast<char*>(
      ::mmap(nullptr, sb.st_size, PROT_READ, MAP_PRIVATE, fp, 0));
  ::close(fp);
  // end of POSIX mmap API (all error handling ommitted)

  // UB1
  const auto header = reinterpret_cast<const Header*>(data);

  // UB2
  if (!std::equal(MAGIC, MAGIC + sizeof(MAGIC), header->magic)) {
    throw std::runtime_error("Magic word mismatch");
  }

  // UB3
  auto beg = reinterpret_cast<const double*>(data + sizeof(Header));

  // UB4
  auto end = std::next(beg, header->size);

  // UB5
  auto sum = std::accumulate(beg, end, double{0});

  ::munmap(data, sb.st_size);

  return sum;
}

int main() {
  const double expected = 4950.0;
  write_binary_data("test-data.bin");

  if (auto sum = read_binary_data("test-data.bin"); sum == expected) {
    std::cout << "as expected, sum is: " << sum << "\n";
  } else {
    std::cout << "error\n";
  }
}

编译并运行为：

$ clang++ example.cpp -std=c++17 -Wall -Wextra -O3 -march=native
$ ./a.out
$ as expected, sum is: 4950

在现实生活中，实际的二进制格式要复杂得多，但保留了相同的属性：基本类型以适当的对齐方式存储在二进制文件中。

我的问题是：你们如何处理这个用例？

我发现了许多我认为相互矛盾的答案。

一些明确指出应该在本地构建 objects。这很可能是这种情况，但会使任何 array-oriented 操作严重复杂化。

评论 elsewhere 似乎同意此构造的 UB 性质，但存在一些分歧。

cppreference 中的措辞，至少对我来说，令人困惑。我会将其解释为 "what I'm doing is perfectly legal"。具体这一段：

Whenever an attempt is made to read or modify the stored value of an object of type DynamicType through a glvalue of type AliasedType, the behavior is undefined unless one of the following is true:

AliasedType and DynamicType are similar.

AliasedType is the (possibly cv-qualified) signed or unsigned variant of DynamicType.

AliasedType is std::byte, (since C++17)char, or unsigned char: this permits examination of the object representation of any object as an array of bytes.

C++17 可能为 std::launder or that I'll have to wait until C++20 for something along the lines of std::bit_cast 提供了一些希望。

同时，您是如何处理这个问题的？

Link 到 on-line 演示：https://onlinegdb.com/rk_xnlRUV

C 中的简化示例

我的理解是否正确，以下 C 程序没有出现未定义行为？我知道通过 char 缓冲区的指针转换不参与严格的别名规则。

#include <stdint.h>
#include <stdio.h>

struct Header {
  char     magic[8];
  uint64_t size;
};

static void process(const char* buffer) {
  const struct Header* h = (const struct Header*)(buffer);
  printf("reading %llu values from buffer\n", h->size);
}

int main(int argc, char* argv[]) {
  if (argc != 2) {
    return 1;
  }
  // In practice, I'd pass the buffer through mmap
  FILE* fp = fopen(argv[1], "rb");
  char  buffer[sizeof(struct Header)];
  fread(buffer, sizeof(struct Header), 1, fp);
  fclose(fp);
  process(buffer);
}

我可以通过传递由原始 C++ 程序创建的文件来编译和运行此 C 代码，并按预期工作：

$ clang struct.c -std=c11 -Wall -Wextra -O3 -march=native
$ ./a.out test-data.bin 
reading 100 values from buffer

Answer 1

std::launder 解决了严格别名的问题，但没有解决对象生命周期的问题。

std::bit_cast 制作副本（它基本上是 std::memcpy 的包装器）并且不适用于从一系列字节复制。

标准 C++ 中没有工具可以在不复制的情况下重新解释映射内存。已提出此类工具：std::bless。 Until/unless这样的改变被采纳到标准中，你要么希望 UB 不会破坏任何东西^†，要么抓住潜力^††性能命中复制，或者用C写程序。

^† 虽然不理想，但这并不一定像听起来那么糟糕。您已经通过使用 mmap 限制了可移植性，并且如果您的目标系统/编译器承诺可以重新解释 mmapped 内存（可能带有洗钱），那么应该没有问题。也就是说，我不知道 Linux 上的 GCC 是否提供了这样的保证。

^†† 编译器可能会优化 std::memcpy。可能不会影响任何性能。在这个 SO answer 中有一个方便的函数，它被观察到被优化掉了，但确实按照语言规则启动对象生命周期。它确实有一个限制，映射内存必须是可写的（因为它在内存中创建对象，并且在非优化构建中它可能会进行实际复制）。

在内存映射中使用 reinterpret_cast 时处理未定义的行为

Dealing with undefined behavior when using reinterpret_cast in a memory mapping

c++

undefined-behavior

memory-mapping

reinterpret-cast

C 中的简化示例