为什么完全静态的 Rust ELF 二进制文件有一个全局偏移 Table (GOT) 部分？

Question

此代码在为 x86_64-unknown-linux-musl 目标编译时会生成一个 .got 部分：

fn main() {
    println!("Hello, world!");
}

$ cargo build --release --target x86_64-unknown-linux-musl
$ readelf -S hello
There are 30 section headers, starting at offset 0x26dc08:

Section Headers:
[Nr] Name              Type             Address           Offset
   Size              EntSize          Flags  Link  Info  Align
...
[12] .got              PROGBITS         0000000000637b58  00037b58
   00000000000004a8  0000000000000008  WA       0     0     8
...

根据类似 C 代码的，.got 部分是一个可以安全删除的工件。但是，它对我来说是段错误：

$ objcopy -R.got hello hello_no_got
$ ./hello_no_got
[1]    3131 segmentation fault (core dumped)  ./hello_no_got

看反汇编，GOT基本都是静态函数地址：

$ objdump -d hello -M intel
...
0000000000400340 <_ZN5hello4main17h5d434a6e08b2e3b8E>:
...
  40037c:       ff 15 26 7a 23 00       call   QWORD PTR [rip+0x237a26]        # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
...

$ objdump -s -j .got hello | grep 637da8
637da8 50434000 00000000 b0854000 00000000  PC@.......@.....

$ objdump -d hello -M intel | grep 404350
0000000000404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>:
  404350:       41 57                   push   r15

数字404350来自50434000 00000000，这是一个小端0x00000000000404350（这并不明显；我不得不运行 GDB下的二进制文件来解决这个问题！）

这令人费解，因为维基百科 says

[GOT] is used by executed programs to find during runtime addresses of global variables, unknown in compile time. The global offset table is updated in process bootstrap by the dynamic linker.

为什么会出现GOT？从反汇编来看，编译器似乎知道所有需要的地址。据我所知，动态链接器没有完成 bootstrap：我的二进制文件中既没有 INTERP 也没有 DYNAMIC 程序头；
为什么GOT要存储函数指针？维基百科说GOT只是针对全局变量的，函数指针应该包含在PLT中。

Answer 1

TL;DR 总结：GOT 确实是一个基本的构建工件，我能够通过简单的机器代码操作摆脱它。

细分

如果我们看看

$ objdump -dj .text hello

并搜索 GLOBAL，我们只看到四种不同类型的 GOT 引用（常量不同）：

  40037c:       ff 15 26 7a 23 00       call   QWORD PTR [rip+0x237a26]        # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
  425903:       ff 25 5f 26 21 00       jmp    QWORD PTR [rip+0x21265f]        # 637f68 <_GLOBAL_OFFSET_TABLE_+0x410>
  41d8b5:       48 3b 1d b4 a5 21 00    cmp    rbx,QWORD PTR [rip+0x21a5b4]    # 637e70 <_GLOBAL_OFFSET_TABLE_+0x318>
  40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
  40b260:       00

这些都是读指令，也就是说GOT在运行时没有被修改。这反过来意味着我们可以静态解析 GOT 引用的地址！让我们一一考虑引用类型：

call QWORD PTR [rip+0x2126be] 简单地说 "go to address [rip+0x2126be], take 8 bytes from there, interpret them as a function address and call the function"。我们可以简单地用直接调用来替换这条指令：

  40037c:       e8 cf 3f 00 00          call   404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>
  400381:       90                      nop

注意最后的nop：我们需要将构成第一条指令的机器码的6字节全部替换掉，但是我们替换的指令只有5字节，所以需要垫它。从根本上说，当我们修补已编译的二进制文件时，我们可以用一条指令替换另一条指令，前提是它不再更长。

jmp QWORD PTR [rip+0x21265f] 与前一个相同，但不是调用地址而是跳转到该地址。这变成：

  425903:       e9 b8 f7 ff ff          jmp    4250c0 <_ZN68_$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GTwrite_str17hc384e51187942069E>
  425908:       90                      nop

cmp rbx,QWORD PTR [rip+0x21a5b4] - 这从 [rip+0x21a5b4] 中获取 8 个字节并将它们与 rbx 寄存器的内容进行比较。这个比较棘手，因为 cmp 无法将寄存器内容与 64 位立即值进行比较。我们可以为此使用另一个寄存器，但我们不知道在这条指令周围使用了哪些寄存器。一个谨慎的解决方案是

push rax
mov rax,0x0000006363c0
cmp rbx,rax
pop rax

但这将超出我们 7 字节的限制。真正的解决方案源于对 GOT 只包含地址的观察；我们的地址 space （大致）包含在范围 [0x400000; 0x650000]，可以在程序头中看到：

$ readelf -l hello
...
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000035b50 0x0000000000035b50  R E    0x200000
  LOAD           0x0000000000036380 0x0000000000636380 0x0000000000636380
                 0x0000000000001dd0 0x0000000000003918  RW     0x200000
...

由此可见，我们（大部分）可以只比较 GOT 条目的 4 个字节而不是 8 个字节。因此替换为：

  41d8b5:       81 fb c0 63 63 00       cmp    ebx,0x6363c0
  41d8bb:       90                      nop

最后一个由两行 objdump 输出组成，因为 8 个字节放不下一行：

  40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
  40b260:       00

它只是将 GOT 的 8 个字节与一个常量（在本例中为 0x0）进行比较。实际上，我们可以静态地进行比较；如果操作数比较相等，我们将比较替换为

  40b259:       48 39 c0                cmp    rax,rax
  40b25c:       90                      nop
  40b25d:       90                      nop
  40b25e:       90                      nop
  40b25f:       90                      nop
  40b260:       90                      nop

显然，一个寄存器总是等于它自己。这里需要大量填充！

如果左操作数大于右操作数，我们将比较替换为

  40b259:       48 83 fc 00             cmp    rsp,0x0 
  40b25d:       90                      nop
  40b25e:       90                      nop
  40b25f:       90                      nop
  40b260:       90                      nop

实际上，rsp 总是大于零。

如果左操作数小于右操作数，事情会变得有点复杂，但由于我们有很多字节（8！），我们可以管理：

  40b259:  50                      push   rax
  40b25a:  31 c0                   xor    eax,eax
  40b25c:  83 f8 01                cmp    eax,0x1
  40b25f:  58                      pop    rax
  40b260:  90                      nop

请注意，第二条和第三条指令使用 eax 而不是 rax，因为涉及 eax 的 cmp 和 xor 比 eax 少占用一个字节rax.

测试

我已经编写了一个 Python 脚本来自动完成所有这些替换（虽然它有点笨拙并且依赖于 objdump 输出的解析）：

#!/usr/bin/env python3

import re
import sys
import argparse
import subprocess

def read_u64(binary):
    return sum(binary[i] * 256 ** i for i in range(8))

def distance_u32(start, end):
    assert abs(end - start) < 2 ** 31
    diff = end - start
    if diff < 0:
        return 2 ** 32 + diff
    else:
        return diff

def to_u32(x):
    assert 0 <= x < 2 ** 32
    return bytes((x // (256 ** i)) % 256 for i in range(4))

class GotInstruction:
    def __init__(self, lines, symbol_address, symbol_offset):
        self.address = int(lines[0].split(":")[0].strip(), 16)
        self.offset = symbol_offset + (self.address - symbol_address)
        self.got_offset = int(lines[0].split("(File Offset: ")[1].strip().strip(")"), 16)
        self.got_offset = self.got_offset % 0x200000  # No idea why the offset is actually wrong
        self.bytes = []
        for line in lines:
            self.bytes += [int(x, 16) for x in line.split("\t")[1].split()]

class TextDump:
    symbol_regex = re.compile(r"^([0-9,a-f]{16}) <(.*)> \(File Offset: 0x([0-9,a-f]*)\):")

    def __init__(self, binary_path):
        self.got_instructions = []
        objdump_output = subprocess.check_output(["objdump", "-Fdj", ".text", "-M", "intel",
                                                  binary_path])
        lines = objdump_output.decode("utf-8").split("\n")
        current_symbol_address = 0
        current_symbol_offset = 0
        for line_group in self.group_lines(lines):
            match = self.symbol_regex.match(line_group[0])
            if match is not None:
                current_symbol_address = int(match.group(1), 16)
                current_symbol_offset = int(match.group(3), 16)
            elif "_GLOBAL_OFFSET_TABLE_" in line_group[0]:
                instruction = GotInstruction(line_group, current_symbol_address,
                                             current_symbol_offset)
                self.got_instructions.append(instruction)

    @staticmethod
    def group_lines(lines):
        if not lines:
            return
        line_group = [lines[0]]
        for line in lines[1:]:
            if line.count("\t") == 1:  # this line continues the previous one
                line_group.append(line)
            else:
                yield line_group
                line_group = [line]
        yield line_group

    def __iter__(self):
        return iter(self.got_instructions)

def read_binary_file(path):
    try:
        with open(path, "rb") as f:
            return f.read()
    except (IOError, OSError) as exc:
        print(f"Failed to open {path}: {exc.strerror}")
        sys.exit(1)

def write_binary_file(path, content):
    try:
        with open(path, "wb") as f:
            f.write(content)
    except (IOError, OSError) as exc:
        print(f"Failed to open {path}: {exc.strerror}")
        sys.exit(1)

def patch_got_reference(instruction, binary_content):
    got_data = read_u64(binary_content[instruction.got_offset:])
    code = instruction.bytes
    if code[0] == 0xff:
        assert len(code) == 6
        relative_address = distance_u32(instruction.address, got_data)
        if code[1] == 0x15:  # call QWORD PTR [rip+...]
            patch = b"\xe8" + to_u32(relative_address - 5) + b"\x90"
        elif code[1] == 0x25:  # jmp QWORD PTR [rip+...]
            patch = b"\xe9" + to_u32(relative_address - 5) + b"\x90"
        else:
            raise ValueError(f"unknown machine code: {code}")
    elif code[:3] == [0x48, 0x83, 0x3d]:  # cmp QWORD PTR [rip+...],<BYTE>
        assert len(code) == 8
        if got_data == code[7]:
            patch = b"\x48\x39\xc0" + b"\x90" * 5  # cmp rax,rax
        elif got_data > code[7]:
            patch = b"\x48\x83\xfc\x00" + b"\x90" * 3  # cmp rsp,0x0
        else:
            patch = b"\x50\x31\xc0\x83\xf8\x01\x90"  # push rax
                                                     # xor eax,eax
                                                     # cmp eax,0x1
                                                     # pop rax
    elif code[:3] == [0x48, 0x3b, 0x1d]:  # cmp rbx,QWORD PTR [rip+...]
        assert len(code) == 7
        patch = b"\x81\xfb" + to_u32(got_data) + b"\x90"  # cmp ebx,<DWORD>
    else:
        raise ValueError(f"unknown machine code: {code}")
    return dict(offset=instruction.offset, data=patch)

def make_got_patches(binary_path, binary_content):
    patches = []
    text_dump = TextDump(binary_path)
    for instruction in text_dump.got_instructions:
        patches.append(patch_got_reference(instruction, binary_content))
    return patches

def apply_patches(binary_content, patches):
    for patch in patches:
        offset = patch["offset"]
        data = patch["data"]
        binary_content = binary_content[:offset] + data + binary_content[offset + len(data):]
    return binary_content

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("binary_path", help="Path to ELF binary")
    parser.add_argument("-o", "--output", help="Output file path", required=True)
    args = parser.parse_args()

    binary_content = read_binary_file(args.binary_path)
    patches = make_got_patches(args.binary_path, binary_content)
    patched_content = apply_patches(binary_content, patches)
    write_binary_file(args.output, patched_content)

if __name__ == "__main__":
    main()

现在我们可以真正摆脱 GOT 了：

$ cargo build --release --target x86_64-unknown-linux-musl
$ ./resolve_got.py target/x86_64-unknown-linux-musl/release/hello -o hello_no_got
$ objcopy -R.got hello_no_got
$ readelf -e hello_no_got | grep .got
$ ./hello_no_got
Hello, world!

我也在我的 ~3k LOC 应用程序上测试过它，它似乎工作正常。

P.S。我不是汇编专家，所以上面的一些内容可能不准确。

为什么完全静态的 Rust ELF 二进制文件有一个全局偏移 Table (GOT) 部分？

Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?

elf

rust

got

细分

测试