全局指针变量如何存储在内存中？

Question

假设我们有一个简单的代码：

int* q = new int(13);

int main() {
    return 0;
}

显然，变量 q 是全局变量并且已初始化。从 this answer 开始，我们期望 q 变量存储在程序文件中的 初始化数据段 (.data) 中，但它是一个指针，因此它的值（是堆段中的地址）在运行时确定。那么程序文件中数据段中存储的值是多少？

我的尝试：
在我看来，编译器在 data 段中为变量 q （对于 64 位地址通常为 8 个字节）分配了一些 space 没有有意义的值。然后，在text段的main函数代码之前放一些初始化代码，在运行时初始化q变量。汇编中的类似内容：

     ....
     mov  edi, 4
     call operator new(unsigned long)
     mov  DWORD PTR [rax], 13  // rax: 64 bit address (pointer value)

     // offset : q variable offset in data segment, calculated by compiler
     mov  QWORD PTR [ds+offset], rax // store address in data segment
     ....
main:
     ....

有什么想法吗？

Answer 1

是的，这基本上就是它的工作原理。

注意ELF中.data、.bss、.text实际上是段，不是段。您可以通过运行您的编译器自己查看程序集：

c++ -S -O2 test.cpp

您通常会看到一个 main 函数，以及该函数之外的某种初始化代码。程序入口点（C++ 运行时的一部分）将调用初始化代码，然后调用 main。初始化代码还负责运行诸如构造函数之类的事情。

Answer 2

int *q 将进入 .bss，而不是 .data 部分，因为它仅在运行时由非常量初始化器初始化（所以这个仅在 C++ 中合法，在 C 中不合法）。可执行文件的数据段中不需要 8 个字节。

编译器通过将其地址放入 CRT (C 运行-Time) 启动代码在调用 [=15= 之前调用的初始化程序数组中，将初始化程序函数安排为运行 ].

在 Godbolt 编译器资源管理器上，您可以看到没有指令干扰的 init 函数的 asm。请注意，寻址模式只是对 q 的简单 RIP 相对访问。 linker 在此时填充 RIP 的正确偏移量，因为这是一个 link 时间常数，即使 .text 和 .bss 部分最终位于不同的段中。

Godbolt 的 isn't ideal for us. Some of the directives are relevant, but many of them aren't. Below is a hand-chosen mix of gcc6.2 -O3 asm output with Godbolt's "filter directives" option unchecked，仅针对 int* q = new int(13); 语句。（不需要同时编译一个 main，我们不是 link 一个可执行文件）。

# gcc6.2 -O3 output
_GLOBAL__sub_I_q:      # presumably stands for subroutine
    sub     rsp, 8           # align the stack for calling another function
    mov     edi, 4           # 4 bytes
    call    operator new(unsigned long)   # this is the demangled name, like from objdump -dC
    mov     DWORD PTR [rax], 13
    mov     QWORD PTR q[rip], rax      # clang uses the equivalent `[rip + q]`
    add     rsp, 8
    ret

    .globl  q
    .bss
q:
    .zero   8      # reserve 8 bytes in the BSS

没有引用 ELF 数据（或任何其他）段的基址。

也绝对没有段寄存器覆盖。 ELF 段与 x86 段无关。（无论如何，默认的段寄存器是 DS，因此编译器不需要发出 [ds:rip+q] 或任何东西。一些反汇编器可能是显式的并显示 DS 即使没有段覆盖前缀不过在说明上。）

这是编译器安排它在main()之前被调用的方式：

    # the "aw" sets options / flags for this section to tell the linker about it.
    .section        .init_array,"aw"
    .align 8
    .quad   _GLOBAL__sub_I_q       # this assembles to the absolute address of the function.

CRT 起始代码有一个循环，它知道 .init_array 部分的大小，并依次在每个函数指针上使用内存间接 call 指令。

.init_array 部分被标记为可写，因此它进入数据段。我不确定它写的是什么。也许 CRT 代码在调用它们后通过将指针置零来将其标记为已完成？

Linux 中有一个类似的机制用于动态库中的运行ning 初始化程序，这是由 ELF 解释器在执行动态 linking 时完成的。这就是为什么您可以在从手写 asm 创建的动态 linked 二进制文件中从 _start 调用 printf() 或其他 glibc stdio 函数，以及为什么在静态 linked binary 如果你没有调用正确的 init 函数。（有关构建定义自己的 _start 或仅 main() 的静态或动态二进制文件的更多信息，请参阅 this Q&A，无论是否使用 libc）。

全局指针变量如何存储在内存中？

How global pointer variables are stored in memory?

c++

assembly

pointers

heap-memory

compile-time