我可以排除 SIGBUS 是由 "minor page fault" 引起的吗? (内核日志没有分配失败)

Can I rule out that SIGBUS is raised by a "minor page fault"? (Kernel log has no allocation failure)

动机

我正在努力提高对 SIGBUS error in Xwayland 的理解。自 2018 年 2 月 20 日左右以来,一些 Fedora Linux 用户看到了这一点,使用 Xwayland 1.19.6-5.fc27.x86_64 和 Linux 内核 4.15.3-300.fc27.x86-64

遗憾的是我没有通过调试 coredump 获得 kernel "segfault" log message (or equivalent for SIGBUS). Xwayland has some pointless code which traps the fatal signal. But I can see siginfo,这似乎也差不多。

定义

我知道 "major page fault" 当虚拟内存页在 RAM 中不可用且必须从磁盘读取时会发生。对于这个问题,我想我对由 ext4 文件系统支持的页面(例如,不能直接访问块设备)特别感兴趣。

因此 "minor page fault" 是不需要磁盘访问的时候。我认为差异是相当明确的,因为 Linux 公开了主要和次要页面错误的计数器。

我的问题

如果内核发送一个程序 SIGBUS,我想知道我是否应该普遍认为这会是一个主要页面错误。

根据coredump和反汇编,程序收到SIGBUS时是读内存,不是写。 siginfo->si_addr 中的错误地址在映射的系统executable 内,用户不可写,地址似乎在当前文件长度范围内。事实上,在调试 coredump 时,我已经从内存地址读取了非常有说服力的值。似乎 coredump 生成过程读取这个地址没有困难:-(.

我也有信心排除 "invalid address alignment" 的情况 (BUS_ADRALN),因为 siginfo->si_code 是 2,即 BUS_ADRERR、"non-existent physical address"。也因为我在 x86 上,它在大多数情况下允许未对齐的访问,并且陷阱不在任何 SSE 扩展指令中。

我考虑了内核通常负责什么,当它处理它确定的页面错误时 "minor"。我想小错误 可能 无法分配内存,因此引发 SIGBUS。但是,我相信我会注意到这样的分配失败:

我有足够的空闲交换来驱逐用户页面,而且我没有注意到当我的系统开始交换时通常会出现明显的减速。将笔记本电脑从挂起状态唤醒到 ram 几秒钟后发生崩溃,即使以 ~100MB/s 的速度,这也不足以填充 8GB 的​​交换空间。 我也没有看到可怕的内存不足 (OOM) 杀手出现在内核日志中,如果内核未能分配页框或页面 table.

,我会预料到这一点

是否有其他可能是次要页面错误失败并导致 SIGBUS? IE。在内核日志中查找错误时,是否有一些我没有注意到的原因?哪些可以快速发作?

同样,多个核心转储将此显示为从文件系统上的映射文件读取时触发的页面错误。

别有用心

我真的很想错过一个小页面错误的案例。因为这可怕的另一面是我不明白这个 SIGBUS 是如何由硬页面错误引起的。从几个月前开始,我们中的一些用户出现了非常相似的错误。我的内核日志中没有 IO 错误。在正常操作期间,读取指定文件时没有 IO 错误。当 运行 rpm --verify --all 或 运行 对 HDD 进行扩展 SMART 测试时,我没有错误。不幸的是,我似乎很少怀疑。 最接近 怀疑我有一个内核升级,我显然更愿意排除这种可能性;日期并不能完全证明这一点,但也不能完全排除。下一个最接近的日期是今年的微码更新;这似乎更难确定。

轻微页面错误的已知原因

  1. 从逻辑上讲,这听起来像是在为 MAP_PRIVATE 映射实现写时复制时发生了轻微的页面错误。
  2. 它还应该包括 /dev/zero 或 MAP_ANONYMOUS 上的读取错误,假设内核 not implement them as reading a shared zero page 并且没有实现它们以立即为整个映射分配页面。
  3. 但更一般地说,它可以是对页面的任何首次访问。这是因为似乎内存映射的页tables一般都是按需填充的。 (这将由页面错误完成,如果文件页面已经在缓存中,那将只是一个轻微的页面错误)。

    MAP_NONBLOCK (since Linux 2.5.46)

    This flag is meaningful only in conjunction with MAP_POPULATE. Don't perform read-ahead: create page tables entries only for pages that are already present in RAM. Since Linux 2.6.23, this flag causes MAP_POPULATE to do nothing. One day, the combina‐ tion of MAP_POPULATE and MAP_NONBLOCK may be reimplemented.


编辑:进一步摘录详细说明上述内容

一位评论者要求提供更具体的细节,以澄清错误地址和指令。开头有很多节选linkhttps://bugzilla.redhat.com/show_bug.cgi?id=1557682

错误与错误 link 中描述的不同。以下是最近实例的新鲜摘录。

$ gdb 2018-03-21.core
...
Core was generated by `/usr/bin/Xwayland :0 -rootless -terminate -core -listen 4 -listen 5 -displayfd'.
Program terminated with signal SIGBUS, Bus error.
#0  _dl_fixup (l=0x7fc0be2e0130, reloc_arg=203) at ../elf/dl-runtime.c:73
73    const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)];
[Current thread is 1 (Thread 0x7fc0be29fa80 (LWP 1918))]
(gdb) p $_siginfo.si_signum
 = 7
(gdb) p $_siginfo.si_code
 = 2
(gdb) p $_siginfo._sifields._sigfault.si_addr
 = (void *) 0x41bd80
(gdb) disassemble
Dump of assembler code for function _dl_fixup:
   0x00007fc0be0c8bd0 <+0>: push   %rbx
   0x00007fc0be0c8bd1 <+1>: mov    %rdi,%r10
   0x00007fc0be0c8bd4 <+4>: mov    %esi,%esi
   0x00007fc0be0c8bd6 <+6>: lea    (%rsi,%rsi,2),%rdx
   0x00007fc0be0c8bda <+10>:    sub    [=11=]x10,%rsp
   0x00007fc0be0c8bde <+14>:    mov    0x68(%rdi),%rax
   0x00007fc0be0c8be2 <+18>:    mov    0x8(%rax),%rdi
   0x00007fc0be0c8be6 <+22>:    mov    0xf8(%r10),%rax
   0x00007fc0be0c8bed <+29>:    mov    0x8(%rax),%rax
   0x00007fc0be0c8bf1 <+33>:    lea    (%rax,%rdx,8),%r8
   0x00007fc0be0c8bf5 <+37>:    mov    0x70(%r10),%rax
=> 0x00007fc0be0c8bf9 <+41>:    mov    0x8(%r8),%rcx
(gdb) p/x $r8
 = 0x41bd78
(gdb) p/x $r8 + 8
 = 0x41bd80

请注意,此指令正在根据突出显示的源代码行获取值 reloc->r_info

(gdb) p reloc
 = (const Elf64_Rela * const) 0x41bd78
(gdb) p &reloc->r_info
 = (Elf64_Xword *) 0x41bd80
(gdb) p *reloc
 = {r_offset = 8443504, r_info = 936302870535, r_addend = 0}

错误地址在下面的文本映射中(来自 abrtd 捕获的 maps 文件):

00400000-0060b000 r-xp 00000000 fd:00 1708508                            /usr/bin/Xwayland
0080a000-0080d000 r--p 0020a000 fd:00 1708508                            /usr/bin/Xwayland
0080d000-00817000 rw-p 0020d000 fd:00 1708508                            /usr/bin/Xwayland

$ size -x /usr/bin/Xwayland
   text    data     bss     dec     hex filename
0x209ffb     0xbe9d 0x1f3e0 2314872  235278 /usr/bin/Xwayland

我肯定在内核中有一些错误,除非它是内核自检中的错误。

编辑:嗯,实际上似乎其他人最近也注意到 GS 自检失败,但它已经存在于较旧的内核中,并且也出现在 AMD cpu 上。目前似乎没有关于如何修复它的结论。 https://lkml.org/lkml/2018/1/26/436

所以这不是这个错误本身,但我不能排除这个 GS 错误在启用 PTI 或其他东西时会导致更明显的损坏。

$ uname -r
4.15.10-300.fc27.x86_64

$ git describe --all
heads/4.15.10
$ cat ./Documentation/x86/pti.txt
...
2. Run several copies of all of the tools/testing/selftests/x86/ tests
   (excluding MPX and protection_keys) in a loop on multiple CPUs for
   several minutes.  These tests frequently uncover corner cases in the
   kernel entry code.  In general, old kernels might cause these tests
   themselves to crash, but they should never crash the kernel.

$ cd tools/testing/selftests/x86
$ make
...

在 4x 终端中匹配我的 4x 硬件线程:

sh -c ' while true; do for i in *; do if test -x $i; then ./$i || exit; fi ; done; done '

故障很快出现:

[RUN]   ARCH_SET_GS(0x200000000), then schedule to 0x200000000
    Before schedule, set selector to 0x3
    other thread: ARCH_SET_GS(0x200000000) -- sel is 0x0
[FAIL]  GS/BASE changed from 0x3/0x0 to 0x0/0x0

还有

[RUN]   Executing 6-argument 32-bit syscall via VDSO
[WARN]  Flags before=0000000000200ed7 id 0 00 o d i s z 0 a 0 p 1 c
[WARN]  Flags  after=0000000000200682 id 0 00 d i s 0 0 1 
[WARN]  Flags change=0000000000000855 0 00 o z 0 a 0 p 0 c
[OK]    Arguments are preserved across syscall
[NOTE]  R11 has changed:0000000000200682 - assuming clobbered by SYSRET insn
[OK]    R8..R15 did not leak kernel data
[RUN]   Executing 6-argument 32-bit syscall via INT 80
[OK]    Arguments are preserved across syscall
[OK]    R8..R15 did not leak kernel data
[RUN]   Running tests under ptrace
[RUN]   Executing 6-argument 32-bit syscall via VDSO
[WARN]  Flags before=0000000000200ed7 id 0 00 o d i s z 0 a 0 p 1 c
[WARN]  Flags  after=0000000000200686 id 0 00 d i s 0 0 p 1 
[WARN]  Flags change=0000000000000851 0 00 o z 0 a 0 0 c
[OK]    Arguments are preserved across syscall
[NOTE]  R11 has changed:0000000000200686 - assuming clobbered by SYSRET insn
[OK]    R8..R15 did not leak kernel data
[RUN]   Executing 6-argument 32-bit syscall via INT 80
[OK]    Arguments are preserved across syscall
[OK]    R8..R15 did not leak kernel data
Warning: failed to find getcpu in vDSO
[RUN]   Testing getcpu...
[OK]    CPU 0: syscall: cpu 0, node 0
[OK]    CPU 1: syscall: cpu 1, node 0
[OK]    CPU 2: syscall: cpu 2, node 0
[OK]    CPU 3: syscall: cpu 3, node 0
[RUN]   Testing getcpu...
[OK]    CPU 0: syscall: cpu 0, node 0 vdso: cpu 0, node 0 vsyscall: cpu 0, node 0
[OK]    CPU 1: syscall: cpu 1, node 0 vdso: cpu 1, node 0 vsyscall: cpu 1, node 0
[OK]    CPU 2: syscall: cpu 2, node 0 vdso: cpu 2, node 0 vsyscall: cpu 2, node 0
[OK]    CPU 3: syscall: cpu 3, node 0 vdso: cpu 3, node 0 vsyscall: cpu 3, node 0
[NOTE]  failed to find getcpu in vDSO
[RUN]   test gettimeofday()
    vDSO time offsets: 0.000006 0.000000
[OK]    vDSO gettimeofday()'s timeval was okay
[RUN]   test time()
[FAIL]  vDSO returned the wrong time (1522063297 1522063296 1522063297)

感谢大家的支持。这确实是一个瞬态 IO 错误。 SIGBUS read-fault 路径似乎不一定在内核日志中记录任何内容,这与我过去经常看到的 IO 错误不同。

https://marc.info/?l=linux-ide&m=152232081917215&w=2

v4.15 intermittent errors on suspend/resume

To anyone waiting for the other show to drop on the SATA LPM work...

I've found something that's at least in the same area. It triggered a fsck on my system 2 days ago. Evidence suggests it's occurred on many other machines. I felt that was reason enough to give you a heads up :).

I checked and I don't seem to have LPM enabled during runtime, even when running on battery. My errors are all on suspend/resume, so maybe that behaviour was changed at the same time?

It doesn't always show in kernel logs. What I first noticed was a mysterious SIGBUS that kills Xwayland (and hence the entire Gnome session) on resume from suspend. It surprised me to learn that this SIGBUS can happen, without leaving anything like the read errors I'm used to seeing in the kernel log!

My coredumps show the SIGBUS fault address is an instruction read inside the program code of Xwayland. The backtraces vary along the same call chain - the common factor is that they're always at the first instruction of the function. I assume it varies according to which page is not currently in-core, and hence triggers the failing read request.

There are hundreds of backtraces along this same call chain from other users, reported automatically to Fedora, that look the same. At least so far we don't have any more plausible for them. I admit it's funny that Xwayland is so prominent, and I haven't been swamped with SIGBUS in other processes, but I stand by this analysis.

These crashes started within 24 hours of Fedora upgrading to kernel v4.15.

Fedora bug for the Xwayland SIGBUS: https://bugzilla.redhat.com/show_bug.cgi?id=1553979

My duplicate bug I've been spamming with puzzled comments: https://bugzilla.redhat.com/show_bug.cgi?id=1557682

The earliest and biggest of the many crash report buckets:

EXT4 filesystem error:

Mar 27 11:28:30 alan-laptop kernel: PM: suspend exit
...
Mar 27 11:28:30 alan-laptop kernel: EXT4-fs error (device dm-2):  ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Mar 27 11:28:30 alan-laptop kernel: Buffer I/O error on dev dm-2, logical block 0, lost sync page write
(this marked the FS as needing fsck on next boot)

More frequently, it logs these swap errors:

Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ...
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184)
Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)

My laptop LPM status, even after removing AC power:

$ head /sys/class/scsi_host/host*/link_power_management_policy
==> /sys/class/scsi_host/host0/link_power_management_policy <==
max_performance

==> /sys/class/scsi_host/host1/link_power_management_policy <==
max_performance

My laptop is a Dell Lattitude E5450. CPU is i5-5300U (a Broadwell).