访问 FPGA 上的串行设备时内核停止

Kernel stalls when accessing serial device on FPGA

我在 FPGA 上有两个 UART 设备暴露在 Altera Cyclone V SoC 上 Linux。我修改了 DTS 以合并这些设备,并且 Linux 在启动时选择它们:

[    0.879942] (NULL device *): ttyAL0 at MMIO 0xff200400 (irq = 41, base_baud = 3125000) is a Altera UART
[    0.890050] (NULL device *): ttyAL1 at MMIO 0xff200420 (irq = 44, base_baud = 3125000) is a Altera UART

/dev/ 中产生 ttyAL0ttyAL1。这些设备还出现在 /sys/devices/soc/ 中的相关设备子目录中,并且存在驱动程序符号链接,例如:

lrwxrwxrwx    1 root     root             0 Jun 20 10:36 driver -> ../../../bus/platform/drivers/altera_uart
-rw-r--r--    1 root     root          4096 Jun 20 10:36 driver_override
-r--r--r--    1 root     root          4096 Jun 20 10:36 modalias
drwxr-xr-x    2 root     root             0 Jun 20 10:36 power
lrwxrwxrwx    1 root     root             0 Jun 20 10:36 subsystem -> ../../../bus/platform
-rw-r--r--    1 root     root          4096 Jun 20 10:36 uevent

但是,如果我尝试以编程方式或使用 catsetserial 打开端口,在 RCU 调度程序抛出异常之前会有 20 秒的停顿:

[  202.242133] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=2102 jiffies, g=124, c=123, q=254)
[  202.252516] INFO: Stall ended before state dump start
[  223.252109] INFO: rcu_sched self-detected stall on CPU { 0}  (t=2100 jiffies g=125 c=124 q=229)
[  223.260843] Task dump for CPU 0:
[  223.264066] klogd           R running      0   954      1 0x00000002
[  223.270566] [<c0017984>] (unwind_backtrace) from [<c00137e0>] (show_stack+0x20/0x24)
[  223.278319] [<c00137e0>] (show_stack) from [<c004b6cc>] (sched_show_task+0xb0/0x104)
[  223.286045] [<c004b6cc>] (sched_show_task) from [<c004e34c>] (dump_cpu_task+0x48/0x4c)
[  223.293941] [<c004e34c>] (dump_cpu_task) from [<c006ae60>] (rcu_dump_cpu_stacks+0xa0/0xcc)
[  223.302188] [<c006ae60>] (rcu_dump_cpu_stacks) from [<c006e520>] (rcu_check_callbacks+0x488/0x790)
[  223.311137] [<c006e520>] (rcu_check_callbacks) from [<c0072db0>] (update_process_times+0x50/0x70)
[  223.319982] [<c0072db0>] (update_process_times) from [<c0083258>] (tick_sched_timer+0x78/0x27c)
[  223.328656] [<c0083258>] (tick_sched_timer) from [<c00735f4>] (__run_hrtimer+0x90/0x1bc)
[  223.336719] [<c00735f4>] (__run_hrtimer) from [<c0073ef4>] (hrtimer_interrupt+0x140/0x31c)
[  223.344955] [<c0073ef4>] (hrtimer_interrupt) from [<c0016b58>] (twd_handler+0x40/0x50)
[  223.352867] [<c0016b58>] (twd_handler) from [<c00669bc>] (handle_percpu_devid_irq+0x90/0x124)
[  223.361364] [<c00669bc>] (handle_percpu_devid_irq) from [<c0062684>] (generic_handle_irq+0x3c/0x4c)
[  223.370377] [<c0062684>] (generic_handle_irq) from [<c0062948>] (__handle_domain_irq+0x6c/0xb4)
[  223.379042] [<c0062948>] (__handle_domain_irq) from [<c00086b0>] (gic_handle_irq+0x34/0x6c)
[  223.387362] [<c00086b0>] (gic_handle_irq) from [<c0014380>] (__irq_svc+0x40/0x54)
[  223.394811] Exception stack(0xded29cf8 to 0xded29d40)
[  223.399842] 9ce0:                                                       00000001 c06cb200
[  223.407986] 9d00: 00000000 00000000 c0687b34 00000000 00000082 00000001 df418800 c06c416c
[  223.416128] 9d20: ded28000 ded29d9c 00000000 ded29d40 c06cb200 c0029330 200f0113 ffffffff
[  223.424285] [<c0014380>] (__irq_svc) from [<c0029330>] (__do_softirq+0xc4/0x2f0)
[  223.431656] [<c0029330>] (__do_softirq) from [<c00297f8>] (irq_exit+0x88/0xc0)
[  223.438851] [<c00297f8>] (irq_exit) from [<c006294c>] (__handle_domain_irq+0x70/0xb4)
[  223.446649] [<c006294c>] (__handle_domain_irq) from [<c00086b0>] (gic_handle_irq+0x34/0x6c)
[  223.454965] [<c00086b0>] (gic_handle_irq) from [<c0014380>] (__irq_svc+0x40/0x54)
[  223.462412] Exception stack(0xded29e08 to 0xded29e50)
[  223.467443] 9e00:                   dfbd3540 df782ac0 00000000 0000996f df59d6c0 dfbd3540
[  223.475584] 9e20: c0695e20 00000000 df59c1c0 df59c540 ded28030 ded29e6c ded29e70 ded29e50
[  223.483725] 9e40: c047bad0 c004756c 600f0013 ffffffff
[  223.488762] [<c0014380>] (__irq_svc) from [<c004756c>] (finish_task_switch+0x78/0x11c)
[  223.496661] [<c004756c>] (finish_task_switch) from [<c047bad0>] (__schedule+0x230/0x5f4)
[  223.504726] [<c047bad0>] (__schedule) from [<c047bed4>] (schedule+0x40/0x8c)
[  223.511746] [<c047bed4>] (schedule) from [<c0061a58>] (do_syslog+0x51c/0x5a8)
[  223.518855] [<c0061a58>] (do_syslog) from [<c0061b00>] (SyS_syslog+0x1c/0x20)
[  223.525968] [<c0061b00>] (SyS_syslog) from [<c000f820>] (ret_fast_syscall+0x0/0x30)

我不知道为什么会这样,但我注意到 Linux 如何看待我的设备有两件有趣的(即错误的)事情。第一个是它们的 IRQ,即使在引导和任何 bind/unbind 操作期间正确报告, 而不是 列在 /proc/interrupts 中(它们将显示为 ff200400.serial2ff200420.serial3):

           CPU0       CPU1
 29:      47565      47091       GIC  29  twd
 74:          0          0       GIC  74  0009
 75:          0          0       GIC  75  000A
 76:          0          0       GIC  76  000A
 77:          0          0       GIC  77  0004
 78:          0          0       GIC  78  0003
 79:          0          0       GIC  79  0006
 80:          0          0       GIC  80  0011
 81:          0          0       GIC  81  0011
 82:          0          0       GIC  82  0010
171:      10554          0       GIC 171  dw-mci
186:          0          0       GIC 186  dw_spi65535
190:          0          0       GIC 190  ffc04000.i2c
191:          0          0       GIC 191  ffc05000.i2c
192:          0          0       GIC 192  ffc06000.i2c
193:          0          0       GIC 193  ffc07000.i2c
194:        465          0       GIC 194  serial
199:          0          0       GIC 199  timer0
207:          0          0       GIC 207  fpga-mgr
IPI0:          0          0  CPU wakeup interrupts
IPI1:          0          0  Timer broadcast interrupts
IPI2:        591       3015  Rescheduling interrupts
IPI3:          0          0  Function call interrupts
IPI4:          1          5  Single function call interrupts
IPI5:          0          0  CPU stop interrupts
IPI6:          0          0  IRQ work interrupts
IPI7:          0          0  completion interrupts
Err:          0

另一个观察结果是,在 /sys/class/tty 中,ttyAL* 条目链接到 虚拟 设备而不是物理设备:

...
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 tty8 -> ../../devices/virtual/tty/tty8
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 tty9 -> ../../devices/virtual/tty/tty9
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 ttyAL0 -> ../../devices/virtual/tty/ttyAL0
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 ttyAL1 -> ../../devices/virtual/tty/ttyAL1
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 ttyS0 -> ../../devices/soc/ffc02000.serial0/tty/ttyS0
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 ttyS1 -> ../../devices/soc/ffc03000.serial1/tty/ttyS1
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 ttyp0 -> ../../devices/virtual/tty/ttyp0
lrwxrwxrwx    1 root     root             0 Jun 20 10:49 ttyp1 -> ../../devices/virtual/tty/ttyp1
...

您可以看到其他两个物理设备 ttyS0ttyS1(SoC 的 ARM 部分上的 'real' UART),我希望我的设备采用相同的格式.如果您参考上面列出的 /sys/devices/soc/ 设备子目录,您会注意到它没有相应的 tty 子目录——这可能是我有一个与设备关联的虚拟 TTY 的部分原因。

所以我的问题是:为什么我的物理串行设备显示为虚拟设备,这是我遇到内核停顿的原因吗?

如果我在 DTS 中遗漏了重要信息,这里是我添加的 UART:

uart2: serial2@ff200400 {
    compatible = "altr,uart-1.0";
    reg = <0xff200400 0x20>;
    interrupts = <0 9 4>;
    clock-frequency = <50000000>;
    current-speed = <115200>;
};

uart3: serial3@ff200420 {
    compatible = "altr,uart-1.0";
    reg = <0xff200420 0x20>;
    interrupts = <0 12 4>;
    clock-frequency = <50000000>;
    current-speed = <115200>;
};

它们是指定中断控制器的 soc 节点的子节点。

我终于发现了这个问题,从 RCU 调度程序堆栈跟踪来看这并不奇怪:我的 IRQ 是错误的。

我不太了解它的确切机制,因为我不是固件工程师,但 UART 模块的 IRQ 偏移量为 40,因此它们的 IRQ 不是我想的 9 和 12,但 49 和 52。更新 DTS 以匹配导致一切按预期工作。