为什么完全静态的 Rust ELF 二进制文件有一个全局偏移 Table (GOT) 部分?
Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?
此代码在为 x86_64-unknown-linux-musl
目标编译时会生成一个 .got
部分:
fn main() {
println!("Hello, world!");
}
$ cargo build --release --target x86_64-unknown-linux-musl
$ readelf -S hello
There are 30 section headers, starting at offset 0x26dc08:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[12] .got PROGBITS 0000000000637b58 00037b58
00000000000004a8 0000000000000008 WA 0 0 8
...
根据类似 C 代码的 ,.got
部分是一个可以安全删除的工件。但是,它对我来说是段错误:
$ objcopy -R.got hello hello_no_got
$ ./hello_no_got
[1] 3131 segmentation fault (core dumped) ./hello_no_got
看反汇编,GOT基本都是静态函数地址:
$ objdump -d hello -M intel
...
0000000000400340 <_ZN5hello4main17h5d434a6e08b2e3b8E>:
...
40037c: ff 15 26 7a 23 00 call QWORD PTR [rip+0x237a26] # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
...
$ objdump -s -j .got hello | grep 637da8
637da8 50434000 00000000 b0854000 00000000 PC@.......@.....
$ objdump -d hello -M intel | grep 404350
0000000000404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>:
404350: 41 57 push r15
数字404350
来自50434000 00000000
,这是一个小端0x00000000000404350
(这并不明显;我不得不运行 GDB下的二进制文件来解决这个问题!)
这令人费解,因为维基百科 says
[GOT] is used by executed programs to find during runtime addresses of global variables, unknown in compile time. The global offset table is updated in process bootstrap by the dynamic linker.
- 为什么会出现GOT?从反汇编来看,编译器似乎知道所有需要的地址。据我所知,动态链接器没有完成 bootstrap:我的二进制文件中既没有
INTERP
也没有 DYNAMIC
程序头;
- 为什么GOT要存储函数指针?维基百科说GOT只是针对全局变量的,函数指针应该包含在PLT中。
TL;DR 总结:GOT 确实是一个基本的构建工件,我能够通过简单的机器代码操作摆脱它。
细分
如果我们看看
$ objdump -dj .text hello
并搜索 GLOBAL
,我们只看到四种不同类型的 GOT 引用(常量不同):
40037c: ff 15 26 7a 23 00 call QWORD PTR [rip+0x237a26] # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
425903: ff 25 5f 26 21 00 jmp QWORD PTR [rip+0x21265f] # 637f68 <_GLOBAL_OFFSET_TABLE_+0x410>
41d8b5: 48 3b 1d b4 a5 21 00 cmp rbx,QWORD PTR [rip+0x21a5b4] # 637e70 <_GLOBAL_OFFSET_TABLE_+0x318>
40b259: 48 83 3d 7f cb 22 00 cmp QWORD PTR [rip+0x22cb7f],0x0 # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
40b260: 00
这些都是读指令,也就是说GOT在运行时没有被修改。这反过来意味着我们可以静态解析 GOT 引用的地址!让我们一一考虑引用类型:
call QWORD PTR [rip+0x2126be]
简单地说 "go to address [rip+0x2126be]
, take 8 bytes from there, interpret them as a function address and call the function"。我们可以简单地用直接调用来替换这条指令:
40037c: e8 cf 3f 00 00 call 404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>
400381: 90 nop
注意最后的nop
:我们需要将构成第一条指令的机器码的6字节全部替换掉,但是我们替换的指令只有5字节,所以需要垫它。从根本上说,当我们修补已编译的二进制文件时,我们可以用一条指令替换另一条指令,前提是它不再更长。
jmp QWORD PTR [rip+0x21265f]
与前一个相同,但不是调用地址而是跳转到该地址。这变成:
425903: e9 b8 f7 ff ff jmp 4250c0 <_ZN68_$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GTwrite_str17hc384e51187942069E>
425908: 90 nop
cmp rbx,QWORD PTR [rip+0x21a5b4]
- 这从 [rip+0x21a5b4]
中获取 8 个字节并将它们与 rbx
寄存器的内容进行比较。这个比较棘手,因为 cmp
无法将寄存器内容与 64 位立即值进行比较。我们可以为此使用另一个寄存器,但我们不知道在这条指令周围使用了哪些寄存器。一个谨慎的解决方案是
push rax
mov rax,0x0000006363c0
cmp rbx,rax
pop rax
但这将超出我们 7 字节的限制。真正的解决方案源于对 GOT 只包含地址的观察;我们的地址 space (大致)包含在范围 [0x400000; 0x650000],可以在程序头中看到:
$ readelf -l hello
...
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x0000000000035b50 0x0000000000035b50 R E 0x200000
LOAD 0x0000000000036380 0x0000000000636380 0x0000000000636380
0x0000000000001dd0 0x0000000000003918 RW 0x200000
...
由此可见,我们(大部分)可以只比较 GOT 条目的 4 个字节而不是 8 个字节。因此替换为:
41d8b5: 81 fb c0 63 63 00 cmp ebx,0x6363c0
41d8bb: 90 nop
- 最后一个由两行
objdump
输出组成,因为 8 个字节放不下一行:
40b259: 48 83 3d 7f cb 22 00 cmp QWORD PTR [rip+0x22cb7f],0x0 # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
40b260: 00
它只是将 GOT 的 8 个字节与一个常量(在本例中为 0x0)进行比较。实际上,我们可以静态地进行比较;如果操作数比较相等,我们将比较替换为
40b259: 48 39 c0 cmp rax,rax
40b25c: 90 nop
40b25d: 90 nop
40b25e: 90 nop
40b25f: 90 nop
40b260: 90 nop
显然,一个寄存器总是等于它自己。这里需要大量填充!
如果左操作数大于右操作数,我们将比较替换为
40b259: 48 83 fc 00 cmp rsp,0x0
40b25d: 90 nop
40b25e: 90 nop
40b25f: 90 nop
40b260: 90 nop
实际上,rsp
总是大于零。
如果左操作数小于右操作数,事情会变得有点复杂,但由于我们有很多字节(8!),我们可以管理:
40b259: 50 push rax
40b25a: 31 c0 xor eax,eax
40b25c: 83 f8 01 cmp eax,0x1
40b25f: 58 pop rax
40b260: 90 nop
请注意,第二条和第三条指令使用 eax
而不是 rax
,因为涉及 eax
的 cmp
和 xor
比 eax
少占用一个字节rax
.
测试
我已经编写了一个 Python 脚本来自动完成所有这些替换(虽然它有点笨拙并且依赖于 objdump
输出的解析):
#!/usr/bin/env python3
import re
import sys
import argparse
import subprocess
def read_u64(binary):
return sum(binary[i] * 256 ** i for i in range(8))
def distance_u32(start, end):
assert abs(end - start) < 2 ** 31
diff = end - start
if diff < 0:
return 2 ** 32 + diff
else:
return diff
def to_u32(x):
assert 0 <= x < 2 ** 32
return bytes((x // (256 ** i)) % 256 for i in range(4))
class GotInstruction:
def __init__(self, lines, symbol_address, symbol_offset):
self.address = int(lines[0].split(":")[0].strip(), 16)
self.offset = symbol_offset + (self.address - symbol_address)
self.got_offset = int(lines[0].split("(File Offset: ")[1].strip().strip(")"), 16)
self.got_offset = self.got_offset % 0x200000 # No idea why the offset is actually wrong
self.bytes = []
for line in lines:
self.bytes += [int(x, 16) for x in line.split("\t")[1].split()]
class TextDump:
symbol_regex = re.compile(r"^([0-9,a-f]{16}) <(.*)> \(File Offset: 0x([0-9,a-f]*)\):")
def __init__(self, binary_path):
self.got_instructions = []
objdump_output = subprocess.check_output(["objdump", "-Fdj", ".text", "-M", "intel",
binary_path])
lines = objdump_output.decode("utf-8").split("\n")
current_symbol_address = 0
current_symbol_offset = 0
for line_group in self.group_lines(lines):
match = self.symbol_regex.match(line_group[0])
if match is not None:
current_symbol_address = int(match.group(1), 16)
current_symbol_offset = int(match.group(3), 16)
elif "_GLOBAL_OFFSET_TABLE_" in line_group[0]:
instruction = GotInstruction(line_group, current_symbol_address,
current_symbol_offset)
self.got_instructions.append(instruction)
@staticmethod
def group_lines(lines):
if not lines:
return
line_group = [lines[0]]
for line in lines[1:]:
if line.count("\t") == 1: # this line continues the previous one
line_group.append(line)
else:
yield line_group
line_group = [line]
yield line_group
def __iter__(self):
return iter(self.got_instructions)
def read_binary_file(path):
try:
with open(path, "rb") as f:
return f.read()
except (IOError, OSError) as exc:
print(f"Failed to open {path}: {exc.strerror}")
sys.exit(1)
def write_binary_file(path, content):
try:
with open(path, "wb") as f:
f.write(content)
except (IOError, OSError) as exc:
print(f"Failed to open {path}: {exc.strerror}")
sys.exit(1)
def patch_got_reference(instruction, binary_content):
got_data = read_u64(binary_content[instruction.got_offset:])
code = instruction.bytes
if code[0] == 0xff:
assert len(code) == 6
relative_address = distance_u32(instruction.address, got_data)
if code[1] == 0x15: # call QWORD PTR [rip+...]
patch = b"\xe8" + to_u32(relative_address - 5) + b"\x90"
elif code[1] == 0x25: # jmp QWORD PTR [rip+...]
patch = b"\xe9" + to_u32(relative_address - 5) + b"\x90"
else:
raise ValueError(f"unknown machine code: {code}")
elif code[:3] == [0x48, 0x83, 0x3d]: # cmp QWORD PTR [rip+...],<BYTE>
assert len(code) == 8
if got_data == code[7]:
patch = b"\x48\x39\xc0" + b"\x90" * 5 # cmp rax,rax
elif got_data > code[7]:
patch = b"\x48\x83\xfc\x00" + b"\x90" * 3 # cmp rsp,0x0
else:
patch = b"\x50\x31\xc0\x83\xf8\x01\x90" # push rax
# xor eax,eax
# cmp eax,0x1
# pop rax
elif code[:3] == [0x48, 0x3b, 0x1d]: # cmp rbx,QWORD PTR [rip+...]
assert len(code) == 7
patch = b"\x81\xfb" + to_u32(got_data) + b"\x90" # cmp ebx,<DWORD>
else:
raise ValueError(f"unknown machine code: {code}")
return dict(offset=instruction.offset, data=patch)
def make_got_patches(binary_path, binary_content):
patches = []
text_dump = TextDump(binary_path)
for instruction in text_dump.got_instructions:
patches.append(patch_got_reference(instruction, binary_content))
return patches
def apply_patches(binary_content, patches):
for patch in patches:
offset = patch["offset"]
data = patch["data"]
binary_content = binary_content[:offset] + data + binary_content[offset + len(data):]
return binary_content
def main():
parser = argparse.ArgumentParser()
parser.add_argument("binary_path", help="Path to ELF binary")
parser.add_argument("-o", "--output", help="Output file path", required=True)
args = parser.parse_args()
binary_content = read_binary_file(args.binary_path)
patches = make_got_patches(args.binary_path, binary_content)
patched_content = apply_patches(binary_content, patches)
write_binary_file(args.output, patched_content)
if __name__ == "__main__":
main()
现在我们可以真正摆脱 GOT 了:
$ cargo build --release --target x86_64-unknown-linux-musl
$ ./resolve_got.py target/x86_64-unknown-linux-musl/release/hello -o hello_no_got
$ objcopy -R.got hello_no_got
$ readelf -e hello_no_got | grep .got
$ ./hello_no_got
Hello, world!
我也在我的 ~3k LOC 应用程序上测试过它,它似乎工作正常。
P.S。我不是汇编专家,所以上面的一些内容可能不准确。
此代码在为 x86_64-unknown-linux-musl
目标编译时会生成一个 .got
部分:
fn main() {
println!("Hello, world!");
}
$ cargo build --release --target x86_64-unknown-linux-musl
$ readelf -S hello
There are 30 section headers, starting at offset 0x26dc08:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[12] .got PROGBITS 0000000000637b58 00037b58
00000000000004a8 0000000000000008 WA 0 0 8
...
根据类似 C 代码的 .got
部分是一个可以安全删除的工件。但是,它对我来说是段错误:
$ objcopy -R.got hello hello_no_got
$ ./hello_no_got
[1] 3131 segmentation fault (core dumped) ./hello_no_got
看反汇编,GOT基本都是静态函数地址:
$ objdump -d hello -M intel
...
0000000000400340 <_ZN5hello4main17h5d434a6e08b2e3b8E>:
...
40037c: ff 15 26 7a 23 00 call QWORD PTR [rip+0x237a26] # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
...
$ objdump -s -j .got hello | grep 637da8
637da8 50434000 00000000 b0854000 00000000 PC@.......@.....
$ objdump -d hello -M intel | grep 404350
0000000000404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>:
404350: 41 57 push r15
数字404350
来自50434000 00000000
,这是一个小端0x00000000000404350
(这并不明显;我不得不运行 GDB下的二进制文件来解决这个问题!)
这令人费解,因为维基百科 says
[GOT] is used by executed programs to find during runtime addresses of global variables, unknown in compile time. The global offset table is updated in process bootstrap by the dynamic linker.
- 为什么会出现GOT?从反汇编来看,编译器似乎知道所有需要的地址。据我所知,动态链接器没有完成 bootstrap:我的二进制文件中既没有
INTERP
也没有DYNAMIC
程序头; - 为什么GOT要存储函数指针?维基百科说GOT只是针对全局变量的,函数指针应该包含在PLT中。
TL;DR 总结:GOT 确实是一个基本的构建工件,我能够通过简单的机器代码操作摆脱它。
细分
如果我们看看
$ objdump -dj .text hello
并搜索 GLOBAL
,我们只看到四种不同类型的 GOT 引用(常量不同):
40037c: ff 15 26 7a 23 00 call QWORD PTR [rip+0x237a26] # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
425903: ff 25 5f 26 21 00 jmp QWORD PTR [rip+0x21265f] # 637f68 <_GLOBAL_OFFSET_TABLE_+0x410>
41d8b5: 48 3b 1d b4 a5 21 00 cmp rbx,QWORD PTR [rip+0x21a5b4] # 637e70 <_GLOBAL_OFFSET_TABLE_+0x318>
40b259: 48 83 3d 7f cb 22 00 cmp QWORD PTR [rip+0x22cb7f],0x0 # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
40b260: 00
这些都是读指令,也就是说GOT在运行时没有被修改。这反过来意味着我们可以静态解析 GOT 引用的地址!让我们一一考虑引用类型:
call QWORD PTR [rip+0x2126be]
简单地说 "go to address[rip+0x2126be]
, take 8 bytes from there, interpret them as a function address and call the function"。我们可以简单地用直接调用来替换这条指令:
40037c: e8 cf 3f 00 00 call 404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>
400381: 90 nop
注意最后的nop
:我们需要将构成第一条指令的机器码的6字节全部替换掉,但是我们替换的指令只有5字节,所以需要垫它。从根本上说,当我们修补已编译的二进制文件时,我们可以用一条指令替换另一条指令,前提是它不再更长。
jmp QWORD PTR [rip+0x21265f]
与前一个相同,但不是调用地址而是跳转到该地址。这变成:
425903: e9 b8 f7 ff ff jmp 4250c0 <_ZN68_$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GTwrite_str17hc384e51187942069E>
425908: 90 nop
cmp rbx,QWORD PTR [rip+0x21a5b4]
- 这从[rip+0x21a5b4]
中获取 8 个字节并将它们与rbx
寄存器的内容进行比较。这个比较棘手,因为cmp
无法将寄存器内容与 64 位立即值进行比较。我们可以为此使用另一个寄存器,但我们不知道在这条指令周围使用了哪些寄存器。一个谨慎的解决方案是
push rax
mov rax,0x0000006363c0
cmp rbx,rax
pop rax
但这将超出我们 7 字节的限制。真正的解决方案源于对 GOT 只包含地址的观察;我们的地址 space (大致)包含在范围 [0x400000; 0x650000],可以在程序头中看到:
$ readelf -l hello
...
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x0000000000035b50 0x0000000000035b50 R E 0x200000
LOAD 0x0000000000036380 0x0000000000636380 0x0000000000636380
0x0000000000001dd0 0x0000000000003918 RW 0x200000
...
由此可见,我们(大部分)可以只比较 GOT 条目的 4 个字节而不是 8 个字节。因此替换为:
41d8b5: 81 fb c0 63 63 00 cmp ebx,0x6363c0
41d8bb: 90 nop
- 最后一个由两行
objdump
输出组成,因为 8 个字节放不下一行:
40b259: 48 83 3d 7f cb 22 00 cmp QWORD PTR [rip+0x22cb7f],0x0 # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
40b260: 00
它只是将 GOT 的 8 个字节与一个常量(在本例中为 0x0)进行比较。实际上,我们可以静态地进行比较;如果操作数比较相等,我们将比较替换为
40b259: 48 39 c0 cmp rax,rax
40b25c: 90 nop
40b25d: 90 nop
40b25e: 90 nop
40b25f: 90 nop
40b260: 90 nop
显然,一个寄存器总是等于它自己。这里需要大量填充!
如果左操作数大于右操作数,我们将比较替换为
40b259: 48 83 fc 00 cmp rsp,0x0
40b25d: 90 nop
40b25e: 90 nop
40b25f: 90 nop
40b260: 90 nop
实际上,rsp
总是大于零。
如果左操作数小于右操作数,事情会变得有点复杂,但由于我们有很多字节(8!),我们可以管理:
40b259: 50 push rax
40b25a: 31 c0 xor eax,eax
40b25c: 83 f8 01 cmp eax,0x1
40b25f: 58 pop rax
40b260: 90 nop
请注意,第二条和第三条指令使用 eax
而不是 rax
,因为涉及 eax
的 cmp
和 xor
比 eax
少占用一个字节rax
.
测试
我已经编写了一个 Python 脚本来自动完成所有这些替换(虽然它有点笨拙并且依赖于 objdump
输出的解析):
#!/usr/bin/env python3
import re
import sys
import argparse
import subprocess
def read_u64(binary):
return sum(binary[i] * 256 ** i for i in range(8))
def distance_u32(start, end):
assert abs(end - start) < 2 ** 31
diff = end - start
if diff < 0:
return 2 ** 32 + diff
else:
return diff
def to_u32(x):
assert 0 <= x < 2 ** 32
return bytes((x // (256 ** i)) % 256 for i in range(4))
class GotInstruction:
def __init__(self, lines, symbol_address, symbol_offset):
self.address = int(lines[0].split(":")[0].strip(), 16)
self.offset = symbol_offset + (self.address - symbol_address)
self.got_offset = int(lines[0].split("(File Offset: ")[1].strip().strip(")"), 16)
self.got_offset = self.got_offset % 0x200000 # No idea why the offset is actually wrong
self.bytes = []
for line in lines:
self.bytes += [int(x, 16) for x in line.split("\t")[1].split()]
class TextDump:
symbol_regex = re.compile(r"^([0-9,a-f]{16}) <(.*)> \(File Offset: 0x([0-9,a-f]*)\):")
def __init__(self, binary_path):
self.got_instructions = []
objdump_output = subprocess.check_output(["objdump", "-Fdj", ".text", "-M", "intel",
binary_path])
lines = objdump_output.decode("utf-8").split("\n")
current_symbol_address = 0
current_symbol_offset = 0
for line_group in self.group_lines(lines):
match = self.symbol_regex.match(line_group[0])
if match is not None:
current_symbol_address = int(match.group(1), 16)
current_symbol_offset = int(match.group(3), 16)
elif "_GLOBAL_OFFSET_TABLE_" in line_group[0]:
instruction = GotInstruction(line_group, current_symbol_address,
current_symbol_offset)
self.got_instructions.append(instruction)
@staticmethod
def group_lines(lines):
if not lines:
return
line_group = [lines[0]]
for line in lines[1:]:
if line.count("\t") == 1: # this line continues the previous one
line_group.append(line)
else:
yield line_group
line_group = [line]
yield line_group
def __iter__(self):
return iter(self.got_instructions)
def read_binary_file(path):
try:
with open(path, "rb") as f:
return f.read()
except (IOError, OSError) as exc:
print(f"Failed to open {path}: {exc.strerror}")
sys.exit(1)
def write_binary_file(path, content):
try:
with open(path, "wb") as f:
f.write(content)
except (IOError, OSError) as exc:
print(f"Failed to open {path}: {exc.strerror}")
sys.exit(1)
def patch_got_reference(instruction, binary_content):
got_data = read_u64(binary_content[instruction.got_offset:])
code = instruction.bytes
if code[0] == 0xff:
assert len(code) == 6
relative_address = distance_u32(instruction.address, got_data)
if code[1] == 0x15: # call QWORD PTR [rip+...]
patch = b"\xe8" + to_u32(relative_address - 5) + b"\x90"
elif code[1] == 0x25: # jmp QWORD PTR [rip+...]
patch = b"\xe9" + to_u32(relative_address - 5) + b"\x90"
else:
raise ValueError(f"unknown machine code: {code}")
elif code[:3] == [0x48, 0x83, 0x3d]: # cmp QWORD PTR [rip+...],<BYTE>
assert len(code) == 8
if got_data == code[7]:
patch = b"\x48\x39\xc0" + b"\x90" * 5 # cmp rax,rax
elif got_data > code[7]:
patch = b"\x48\x83\xfc\x00" + b"\x90" * 3 # cmp rsp,0x0
else:
patch = b"\x50\x31\xc0\x83\xf8\x01\x90" # push rax
# xor eax,eax
# cmp eax,0x1
# pop rax
elif code[:3] == [0x48, 0x3b, 0x1d]: # cmp rbx,QWORD PTR [rip+...]
assert len(code) == 7
patch = b"\x81\xfb" + to_u32(got_data) + b"\x90" # cmp ebx,<DWORD>
else:
raise ValueError(f"unknown machine code: {code}")
return dict(offset=instruction.offset, data=patch)
def make_got_patches(binary_path, binary_content):
patches = []
text_dump = TextDump(binary_path)
for instruction in text_dump.got_instructions:
patches.append(patch_got_reference(instruction, binary_content))
return patches
def apply_patches(binary_content, patches):
for patch in patches:
offset = patch["offset"]
data = patch["data"]
binary_content = binary_content[:offset] + data + binary_content[offset + len(data):]
return binary_content
def main():
parser = argparse.ArgumentParser()
parser.add_argument("binary_path", help="Path to ELF binary")
parser.add_argument("-o", "--output", help="Output file path", required=True)
args = parser.parse_args()
binary_content = read_binary_file(args.binary_path)
patches = make_got_patches(args.binary_path, binary_content)
patched_content = apply_patches(binary_content, patches)
write_binary_file(args.output, patched_content)
if __name__ == "__main__":
main()
现在我们可以真正摆脱 GOT 了:
$ cargo build --release --target x86_64-unknown-linux-musl
$ ./resolve_got.py target/x86_64-unknown-linux-musl/release/hello -o hello_no_got
$ objcopy -R.got hello_no_got
$ readelf -e hello_no_got | grep .got
$ ./hello_no_got
Hello, world!
我也在我的 ~3k LOC 应用程序上测试过它,它似乎工作正常。
P.S。我不是汇编专家,所以上面的一些内容可能不准确。