地址规范形式和指针算法

Question

在 AMD64 兼容架构上，地址在取消引用之前需要采用规范形式。

In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros.

现在，当前操作系统和体系结构上最重要的实现位是第 47 位。这给我们留下了一个 48 位地址 space.

特别是启用 ASLR 时，用户程序可以期望接收到设置了第 47 位的地址。

如果使用指针标记等优化并使用高位存储信息，则程序必须确保第 48 位到第 63 位设置回取消引用地址之前的第 47 位。

但请考虑以下代码：

int main()
{
    int* intArray = new int[100];

    int* it = intArray;

    // Fill the array with any value.
    for (int i = 0; i < 100; i++)
    {
        *it = 20;
        it++;   
    }

    delete [] intArray;
    return 0;
}

现在考虑 intArray 是：

0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1100

将it设置为intArray后增加一次it，再考虑sizeof(int) == 4，则变为：

0000 0000 0000 0000 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

第47位为粗体。这里发生的是指针算法检索到的第二个指针是无效的，因为它不是规范形式。正确的地址应该是：

1111 1111 1111 1111 1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

程序如何处理这个问题？ OS 是否保证您永远不会分配到地址范围不随第 47 位变化的内存？

Answer 1

规范地址规则意味着 64 位虚拟地址中存在一个巨大的漏洞 space。 2^47-1 not 与其上方的下一个 valid 地址相邻，因此单个 mmap 不会包含任何64 位地址的不可用范围。

+----------+
| 2^64-1   |   0xffffffffffffffff
| ...      |
| 2^64-2^47|   0xffff800000000000
+----------+
|          |
| unusable |      not to scale: this part is 2^16 times as large
|          |
+----------+
| 2^47-1   |   0x00007fffffffffff
| ...      |
| 0        |   0x0000000000000000
+----------+

还有大多数内核 reserve the high half of the canonical range for their own use. e.g. x86-64 Linux's memory map。 User-space 无论如何只能在连续的低范围内分配，所以间隙的存在是无关紧要的。

Is there a guarantee by the OS that you will never be allocated memory whose address range does not vary by the 47th bit?

不完全是。当前硬件支持的48位地址space是一个实现细节。规范地址规则确保未来的系统可以支持更多的虚拟地址位，而不会严重破坏向后兼容性。

至多，您只需要一个兼容标志即可让 OS 不向进程提供高位不完全相同的任何内存区域。（例如 Linux 的当前 MAP_32BIT flag for mmap，或进程范围的设置）。这可以支持使用高位标记和手动重做符号扩展的程序。

未来的硬件不需要支持任何类型的标志来忽略高地址位，因为高位中的垃圾目前是一个错误。 Intel 5-level paging adds another 9 virtual address bits, widening the canonical high andd low halves. white paper.

另见

有趣的事实：Linux 默认将堆栈映射到较低有效地址范围的顶部。（相关：）

$ gdb /bin/ls
...
(gdb) b _start
Function "_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (_start) pending.
(gdb) r
Starting program: /bin/ls

Breakpoint 1, 0x00007ffff7dd9cd0 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) p $rsp
 = (void *) 0x7fffffffd850
(gdb) exit

$ calc
2^47-1
              0x7fffffffffff

（现代 GDB 可以使用 starti 在第一个 user-space 指令执行之前中断，而不是乱用断点命令。）

地址规范形式和指针算法

Address canonical form and pointer arithmetic

x86-64

pointer-arithmetic

memory-address

access-violation

aslr