rcu_preempt 在 CPU { 0} 上自我检测到停顿

Question

我正在运行在我的 intel rangeley 板上安装一个应用程序，它有 3.14.29-rt22 内核运行宁就可以了。应用程序将运行两个线程，每个线程的 pri :39。周期性地持续 1 和 2 毫秒。两个线程都将在连续的 while 循环中运行ning，这将是运行ning 仅在核心 0 上。在运行ning 之后的某个时间，大约 10 分钟。当我按 ctrl+c 时，它会在下面给出日志。

**INFO: rcu_preempt self-detected stall on CPU { 0}  (t=21000 jiffies g=2362 c=2361 q=207)**
**sending NMI to all CPUs:
 NMI backtrace for cpu 1**

 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.14.29ltsi-rt22-yocto-preempt-rt+ #1

 Hardware name: ADI Engineering RCC-VE/RCC-VE, BIOS ADI_RCCVE-01.00.00.04-nodebug 05/06/2015

task: ffff8802761a0000 ti: ffff8802761a8000 task.ti: ffff8802761a8000
RIP: 0010:[<ffffffff8100b451>]  [<ffffffff8100b451>]   native_read_tsc+0x1/0x20
RSP: 0018:ffff8802761abe28  EFLAGS: 00000003
RAX: 0000000000000000 RBX: ffffffff81e1acc0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffffff81e1acc0
RBP: ffff8802761abe38 R08: ffff8802761a8000 R09: 0000000000000001
R10: 0000000000000800 R11: 0000000000000000 R12: 000000000000003e
R13: 0000000000014e76 R14: ffff8802761abfd8 R15: ffff88027fc8cf00
FS:  0000000000000000(0000) GS:ffff88027fc80000(0000)   knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fabcd23f000 CR3: 0000000269589000 CR4: 00000000001007e0
Stack:
ffff8802761abe38 ffffffff8100b4a9 ffff8802761abe60 ffffffff810a6b73
0000000000000001 ffff8802761abfd8 ffffffff81edc030 ffff8802761abec0
ffffffff810b01a5 ffffffffffffff10 ffffffff8103b906 0000000000000000
Call Trace:
[<ffffffff8100b4a9>] ? read_tsc+0x9/0x20
[<ffffffff810a6b73>] ktime_get+0x43/0xc0
[<ffffffff810b01a5>] __tick_nohz_idle_enter+0x25/0x480
[<ffffffff8103b906>] ? native_safe_halt+0x6/0x10
[<ffffffff810b064a>] tick_nohz_idle_enter+0x4a/0x80
[<ffffffff8109a626>] cpu_startup_entry+0x46/0x290
[<ffffffff81031597>] start_secondary+0x1b7/0x210

可能是什么原因？是因为我一直在长期使用 CPU 吗？当我从控制台上的线程打印任何内容时，没有发生此崩溃。

Answer 1

是的，长时间从高优先级线程连续使用 CPU（从调度程序的角度来看，1ms 是一个很大的周期）可能是 RCU 停顿的原因。

来自关于 RCU stall detector 的文档：

The following problems can result in RCU CPU stall warnings:

... A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might happen to preempt a low-priority task in the middle of an RCU read-side critical section. This is especially damaging if that low-priority task is not permitted to run on any other CPU, in which case the next RCU grace period can never complete, which will eventually cause the system to run out of memory and hang.

... A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that is running at a higher priority than the RCU softirq threads. This will prevent RCU callbacks from ever being invoked, and in a CONFIG_PREEMPT_RCU kernel will further prevent RCU grace periods from ever completing. Either way, the system will eventually run out of memory and hang.

从高优先级线程执行任何系统调用（如 write() 到控制台）让内核执行一些针对系统维护的工作。

可能，sched_yield 也会有帮助。

Answer 2

所以我在启动过程中得到了与此惊人相似的东西，它会挂起并按任何键（甚至是 num lock）会取消挂起并在几秒钟后再次挂起。每次启动必须执行此操作 5-7 次！

罪魁祸首是 BIOS 中的一个设置，AMD C1E 支持设置为启用并将其设置为自动或禁用（均已测试）为我解决了这个问题！没有了 stalls/hangs!

rcu_preempt 在 CPU { 0} 上自我检测到停顿

rcu_preempt self-detected stall on CPU { 0}

scheduling

linux-kernel