访问 FPGA 上的串行设备时内核停止
Kernel stalls when accessing serial device on FPGA
我在 FPGA 上有两个 UART 设备暴露在 Altera Cyclone V SoC 上 Linux。我修改了 DTS 以合并这些设备,并且 Linux 在启动时选择它们:
[ 0.879942] (NULL device *): ttyAL0 at MMIO 0xff200400 (irq = 41, base_baud = 3125000) is a Altera UART
[ 0.890050] (NULL device *): ttyAL1 at MMIO 0xff200420 (irq = 44, base_baud = 3125000) is a Altera UART
在 /dev/
中产生 ttyAL0
和 ttyAL1
。这些设备还出现在 /sys/devices/soc/
中的相关设备子目录中,并且存在驱动程序符号链接,例如:
lrwxrwxrwx 1 root root 0 Jun 20 10:36 driver -> ../../../bus/platform/drivers/altera_uart
-rw-r--r-- 1 root root 4096 Jun 20 10:36 driver_override
-r--r--r-- 1 root root 4096 Jun 20 10:36 modalias
drwxr-xr-x 2 root root 0 Jun 20 10:36 power
lrwxrwxrwx 1 root root 0 Jun 20 10:36 subsystem -> ../../../bus/platform
-rw-r--r-- 1 root root 4096 Jun 20 10:36 uevent
但是,如果我尝试以编程方式或使用 cat
或 setserial
打开端口,在 RCU 调度程序抛出异常之前会有 20 秒的停顿:
[ 202.242133] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=2102 jiffies, g=124, c=123, q=254)
[ 202.252516] INFO: Stall ended before state dump start
[ 223.252109] INFO: rcu_sched self-detected stall on CPU { 0} (t=2100 jiffies g=125 c=124 q=229)
[ 223.260843] Task dump for CPU 0:
[ 223.264066] klogd R running 0 954 1 0x00000002
[ 223.270566] [<c0017984>] (unwind_backtrace) from [<c00137e0>] (show_stack+0x20/0x24)
[ 223.278319] [<c00137e0>] (show_stack) from [<c004b6cc>] (sched_show_task+0xb0/0x104)
[ 223.286045] [<c004b6cc>] (sched_show_task) from [<c004e34c>] (dump_cpu_task+0x48/0x4c)
[ 223.293941] [<c004e34c>] (dump_cpu_task) from [<c006ae60>] (rcu_dump_cpu_stacks+0xa0/0xcc)
[ 223.302188] [<c006ae60>] (rcu_dump_cpu_stacks) from [<c006e520>] (rcu_check_callbacks+0x488/0x790)
[ 223.311137] [<c006e520>] (rcu_check_callbacks) from [<c0072db0>] (update_process_times+0x50/0x70)
[ 223.319982] [<c0072db0>] (update_process_times) from [<c0083258>] (tick_sched_timer+0x78/0x27c)
[ 223.328656] [<c0083258>] (tick_sched_timer) from [<c00735f4>] (__run_hrtimer+0x90/0x1bc)
[ 223.336719] [<c00735f4>] (__run_hrtimer) from [<c0073ef4>] (hrtimer_interrupt+0x140/0x31c)
[ 223.344955] [<c0073ef4>] (hrtimer_interrupt) from [<c0016b58>] (twd_handler+0x40/0x50)
[ 223.352867] [<c0016b58>] (twd_handler) from [<c00669bc>] (handle_percpu_devid_irq+0x90/0x124)
[ 223.361364] [<c00669bc>] (handle_percpu_devid_irq) from [<c0062684>] (generic_handle_irq+0x3c/0x4c)
[ 223.370377] [<c0062684>] (generic_handle_irq) from [<c0062948>] (__handle_domain_irq+0x6c/0xb4)
[ 223.379042] [<c0062948>] (__handle_domain_irq) from [<c00086b0>] (gic_handle_irq+0x34/0x6c)
[ 223.387362] [<c00086b0>] (gic_handle_irq) from [<c0014380>] (__irq_svc+0x40/0x54)
[ 223.394811] Exception stack(0xded29cf8 to 0xded29d40)
[ 223.399842] 9ce0: 00000001 c06cb200
[ 223.407986] 9d00: 00000000 00000000 c0687b34 00000000 00000082 00000001 df418800 c06c416c
[ 223.416128] 9d20: ded28000 ded29d9c 00000000 ded29d40 c06cb200 c0029330 200f0113 ffffffff
[ 223.424285] [<c0014380>] (__irq_svc) from [<c0029330>] (__do_softirq+0xc4/0x2f0)
[ 223.431656] [<c0029330>] (__do_softirq) from [<c00297f8>] (irq_exit+0x88/0xc0)
[ 223.438851] [<c00297f8>] (irq_exit) from [<c006294c>] (__handle_domain_irq+0x70/0xb4)
[ 223.446649] [<c006294c>] (__handle_domain_irq) from [<c00086b0>] (gic_handle_irq+0x34/0x6c)
[ 223.454965] [<c00086b0>] (gic_handle_irq) from [<c0014380>] (__irq_svc+0x40/0x54)
[ 223.462412] Exception stack(0xded29e08 to 0xded29e50)
[ 223.467443] 9e00: dfbd3540 df782ac0 00000000 0000996f df59d6c0 dfbd3540
[ 223.475584] 9e20: c0695e20 00000000 df59c1c0 df59c540 ded28030 ded29e6c ded29e70 ded29e50
[ 223.483725] 9e40: c047bad0 c004756c 600f0013 ffffffff
[ 223.488762] [<c0014380>] (__irq_svc) from [<c004756c>] (finish_task_switch+0x78/0x11c)
[ 223.496661] [<c004756c>] (finish_task_switch) from [<c047bad0>] (__schedule+0x230/0x5f4)
[ 223.504726] [<c047bad0>] (__schedule) from [<c047bed4>] (schedule+0x40/0x8c)
[ 223.511746] [<c047bed4>] (schedule) from [<c0061a58>] (do_syslog+0x51c/0x5a8)
[ 223.518855] [<c0061a58>] (do_syslog) from [<c0061b00>] (SyS_syslog+0x1c/0x20)
[ 223.525968] [<c0061b00>] (SyS_syslog) from [<c000f820>] (ret_fast_syscall+0x0/0x30)
我不知道为什么会这样,但我注意到 Linux 如何看待我的设备有两件有趣的(即错误的)事情。第一个是它们的 IRQ,即使在引导和任何 bind/unbind 操作期间正确报告, 而不是 列在 /proc/interrupts
中(它们将显示为 ff200400.serial2
和 ff200420.serial3
):
CPU0 CPU1
29: 47565 47091 GIC 29 twd
74: 0 0 GIC 74 0009
75: 0 0 GIC 75 000A
76: 0 0 GIC 76 000A
77: 0 0 GIC 77 0004
78: 0 0 GIC 78 0003
79: 0 0 GIC 79 0006
80: 0 0 GIC 80 0011
81: 0 0 GIC 81 0011
82: 0 0 GIC 82 0010
171: 10554 0 GIC 171 dw-mci
186: 0 0 GIC 186 dw_spi65535
190: 0 0 GIC 190 ffc04000.i2c
191: 0 0 GIC 191 ffc05000.i2c
192: 0 0 GIC 192 ffc06000.i2c
193: 0 0 GIC 193 ffc07000.i2c
194: 465 0 GIC 194 serial
199: 0 0 GIC 199 timer0
207: 0 0 GIC 207 fpga-mgr
IPI0: 0 0 CPU wakeup interrupts
IPI1: 0 0 Timer broadcast interrupts
IPI2: 591 3015 Rescheduling interrupts
IPI3: 0 0 Function call interrupts
IPI4: 1 5 Single function call interrupts
IPI5: 0 0 CPU stop interrupts
IPI6: 0 0 IRQ work interrupts
IPI7: 0 0 completion interrupts
Err: 0
另一个观察结果是,在 /sys/class/tty
中,ttyAL*
条目链接到 虚拟 设备而不是物理设备:
...
lrwxrwxrwx 1 root root 0 Jun 20 10:49 tty8 -> ../../devices/virtual/tty/tty8
lrwxrwxrwx 1 root root 0 Jun 20 10:49 tty9 -> ../../devices/virtual/tty/tty9
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyAL0 -> ../../devices/virtual/tty/ttyAL0
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyAL1 -> ../../devices/virtual/tty/ttyAL1
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyS0 -> ../../devices/soc/ffc02000.serial0/tty/ttyS0
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyS1 -> ../../devices/soc/ffc03000.serial1/tty/ttyS1
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyp0 -> ../../devices/virtual/tty/ttyp0
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyp1 -> ../../devices/virtual/tty/ttyp1
...
您可以看到其他两个物理设备 ttyS0
和 ttyS1
(SoC 的 ARM 部分上的 'real' UART),我希望我的设备采用相同的格式.如果您参考上面列出的 /sys/devices/soc/
设备子目录,您会注意到它没有相应的 tty
子目录——这可能是我有一个与设备关联的虚拟 TTY 的部分原因。
所以我的问题是:为什么我的物理串行设备显示为虚拟设备,这是我遇到内核停顿的原因吗?
如果我在 DTS 中遗漏了重要信息,这里是我添加的 UART:
uart2: serial2@ff200400 {
compatible = "altr,uart-1.0";
reg = <0xff200400 0x20>;
interrupts = <0 9 4>;
clock-frequency = <50000000>;
current-speed = <115200>;
};
uart3: serial3@ff200420 {
compatible = "altr,uart-1.0";
reg = <0xff200420 0x20>;
interrupts = <0 12 4>;
clock-frequency = <50000000>;
current-speed = <115200>;
};
它们是指定中断控制器的 soc
节点的子节点。
我终于发现了这个问题,从 RCU 调度程序堆栈跟踪来看这并不奇怪:我的 IRQ 是错误的。
我不太了解它的确切机制,因为我不是固件工程师,但 UART 模块的 IRQ 偏移量为 40,因此它们的 IRQ 不是我想的 9 和 12,但 49 和 52。更新 DTS 以匹配导致一切按预期工作。
我在 FPGA 上有两个 UART 设备暴露在 Altera Cyclone V SoC 上 Linux。我修改了 DTS 以合并这些设备,并且 Linux 在启动时选择它们:
[ 0.879942] (NULL device *): ttyAL0 at MMIO 0xff200400 (irq = 41, base_baud = 3125000) is a Altera UART
[ 0.890050] (NULL device *): ttyAL1 at MMIO 0xff200420 (irq = 44, base_baud = 3125000) is a Altera UART
在 /dev/
中产生 ttyAL0
和 ttyAL1
。这些设备还出现在 /sys/devices/soc/
中的相关设备子目录中,并且存在驱动程序符号链接,例如:
lrwxrwxrwx 1 root root 0 Jun 20 10:36 driver -> ../../../bus/platform/drivers/altera_uart
-rw-r--r-- 1 root root 4096 Jun 20 10:36 driver_override
-r--r--r-- 1 root root 4096 Jun 20 10:36 modalias
drwxr-xr-x 2 root root 0 Jun 20 10:36 power
lrwxrwxrwx 1 root root 0 Jun 20 10:36 subsystem -> ../../../bus/platform
-rw-r--r-- 1 root root 4096 Jun 20 10:36 uevent
但是,如果我尝试以编程方式或使用 cat
或 setserial
打开端口,在 RCU 调度程序抛出异常之前会有 20 秒的停顿:
[ 202.242133] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=2102 jiffies, g=124, c=123, q=254)
[ 202.252516] INFO: Stall ended before state dump start
[ 223.252109] INFO: rcu_sched self-detected stall on CPU { 0} (t=2100 jiffies g=125 c=124 q=229)
[ 223.260843] Task dump for CPU 0:
[ 223.264066] klogd R running 0 954 1 0x00000002
[ 223.270566] [<c0017984>] (unwind_backtrace) from [<c00137e0>] (show_stack+0x20/0x24)
[ 223.278319] [<c00137e0>] (show_stack) from [<c004b6cc>] (sched_show_task+0xb0/0x104)
[ 223.286045] [<c004b6cc>] (sched_show_task) from [<c004e34c>] (dump_cpu_task+0x48/0x4c)
[ 223.293941] [<c004e34c>] (dump_cpu_task) from [<c006ae60>] (rcu_dump_cpu_stacks+0xa0/0xcc)
[ 223.302188] [<c006ae60>] (rcu_dump_cpu_stacks) from [<c006e520>] (rcu_check_callbacks+0x488/0x790)
[ 223.311137] [<c006e520>] (rcu_check_callbacks) from [<c0072db0>] (update_process_times+0x50/0x70)
[ 223.319982] [<c0072db0>] (update_process_times) from [<c0083258>] (tick_sched_timer+0x78/0x27c)
[ 223.328656] [<c0083258>] (tick_sched_timer) from [<c00735f4>] (__run_hrtimer+0x90/0x1bc)
[ 223.336719] [<c00735f4>] (__run_hrtimer) from [<c0073ef4>] (hrtimer_interrupt+0x140/0x31c)
[ 223.344955] [<c0073ef4>] (hrtimer_interrupt) from [<c0016b58>] (twd_handler+0x40/0x50)
[ 223.352867] [<c0016b58>] (twd_handler) from [<c00669bc>] (handle_percpu_devid_irq+0x90/0x124)
[ 223.361364] [<c00669bc>] (handle_percpu_devid_irq) from [<c0062684>] (generic_handle_irq+0x3c/0x4c)
[ 223.370377] [<c0062684>] (generic_handle_irq) from [<c0062948>] (__handle_domain_irq+0x6c/0xb4)
[ 223.379042] [<c0062948>] (__handle_domain_irq) from [<c00086b0>] (gic_handle_irq+0x34/0x6c)
[ 223.387362] [<c00086b0>] (gic_handle_irq) from [<c0014380>] (__irq_svc+0x40/0x54)
[ 223.394811] Exception stack(0xded29cf8 to 0xded29d40)
[ 223.399842] 9ce0: 00000001 c06cb200
[ 223.407986] 9d00: 00000000 00000000 c0687b34 00000000 00000082 00000001 df418800 c06c416c
[ 223.416128] 9d20: ded28000 ded29d9c 00000000 ded29d40 c06cb200 c0029330 200f0113 ffffffff
[ 223.424285] [<c0014380>] (__irq_svc) from [<c0029330>] (__do_softirq+0xc4/0x2f0)
[ 223.431656] [<c0029330>] (__do_softirq) from [<c00297f8>] (irq_exit+0x88/0xc0)
[ 223.438851] [<c00297f8>] (irq_exit) from [<c006294c>] (__handle_domain_irq+0x70/0xb4)
[ 223.446649] [<c006294c>] (__handle_domain_irq) from [<c00086b0>] (gic_handle_irq+0x34/0x6c)
[ 223.454965] [<c00086b0>] (gic_handle_irq) from [<c0014380>] (__irq_svc+0x40/0x54)
[ 223.462412] Exception stack(0xded29e08 to 0xded29e50)
[ 223.467443] 9e00: dfbd3540 df782ac0 00000000 0000996f df59d6c0 dfbd3540
[ 223.475584] 9e20: c0695e20 00000000 df59c1c0 df59c540 ded28030 ded29e6c ded29e70 ded29e50
[ 223.483725] 9e40: c047bad0 c004756c 600f0013 ffffffff
[ 223.488762] [<c0014380>] (__irq_svc) from [<c004756c>] (finish_task_switch+0x78/0x11c)
[ 223.496661] [<c004756c>] (finish_task_switch) from [<c047bad0>] (__schedule+0x230/0x5f4)
[ 223.504726] [<c047bad0>] (__schedule) from [<c047bed4>] (schedule+0x40/0x8c)
[ 223.511746] [<c047bed4>] (schedule) from [<c0061a58>] (do_syslog+0x51c/0x5a8)
[ 223.518855] [<c0061a58>] (do_syslog) from [<c0061b00>] (SyS_syslog+0x1c/0x20)
[ 223.525968] [<c0061b00>] (SyS_syslog) from [<c000f820>] (ret_fast_syscall+0x0/0x30)
我不知道为什么会这样,但我注意到 Linux 如何看待我的设备有两件有趣的(即错误的)事情。第一个是它们的 IRQ,即使在引导和任何 bind/unbind 操作期间正确报告, 而不是 列在 /proc/interrupts
中(它们将显示为 ff200400.serial2
和 ff200420.serial3
):
CPU0 CPU1
29: 47565 47091 GIC 29 twd
74: 0 0 GIC 74 0009
75: 0 0 GIC 75 000A
76: 0 0 GIC 76 000A
77: 0 0 GIC 77 0004
78: 0 0 GIC 78 0003
79: 0 0 GIC 79 0006
80: 0 0 GIC 80 0011
81: 0 0 GIC 81 0011
82: 0 0 GIC 82 0010
171: 10554 0 GIC 171 dw-mci
186: 0 0 GIC 186 dw_spi65535
190: 0 0 GIC 190 ffc04000.i2c
191: 0 0 GIC 191 ffc05000.i2c
192: 0 0 GIC 192 ffc06000.i2c
193: 0 0 GIC 193 ffc07000.i2c
194: 465 0 GIC 194 serial
199: 0 0 GIC 199 timer0
207: 0 0 GIC 207 fpga-mgr
IPI0: 0 0 CPU wakeup interrupts
IPI1: 0 0 Timer broadcast interrupts
IPI2: 591 3015 Rescheduling interrupts
IPI3: 0 0 Function call interrupts
IPI4: 1 5 Single function call interrupts
IPI5: 0 0 CPU stop interrupts
IPI6: 0 0 IRQ work interrupts
IPI7: 0 0 completion interrupts
Err: 0
另一个观察结果是,在 /sys/class/tty
中,ttyAL*
条目链接到 虚拟 设备而不是物理设备:
...
lrwxrwxrwx 1 root root 0 Jun 20 10:49 tty8 -> ../../devices/virtual/tty/tty8
lrwxrwxrwx 1 root root 0 Jun 20 10:49 tty9 -> ../../devices/virtual/tty/tty9
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyAL0 -> ../../devices/virtual/tty/ttyAL0
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyAL1 -> ../../devices/virtual/tty/ttyAL1
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyS0 -> ../../devices/soc/ffc02000.serial0/tty/ttyS0
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyS1 -> ../../devices/soc/ffc03000.serial1/tty/ttyS1
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyp0 -> ../../devices/virtual/tty/ttyp0
lrwxrwxrwx 1 root root 0 Jun 20 10:49 ttyp1 -> ../../devices/virtual/tty/ttyp1
...
您可以看到其他两个物理设备 ttyS0
和 ttyS1
(SoC 的 ARM 部分上的 'real' UART),我希望我的设备采用相同的格式.如果您参考上面列出的 /sys/devices/soc/
设备子目录,您会注意到它没有相应的 tty
子目录——这可能是我有一个与设备关联的虚拟 TTY 的部分原因。
所以我的问题是:为什么我的物理串行设备显示为虚拟设备,这是我遇到内核停顿的原因吗?
如果我在 DTS 中遗漏了重要信息,这里是我添加的 UART:
uart2: serial2@ff200400 {
compatible = "altr,uart-1.0";
reg = <0xff200400 0x20>;
interrupts = <0 9 4>;
clock-frequency = <50000000>;
current-speed = <115200>;
};
uart3: serial3@ff200420 {
compatible = "altr,uart-1.0";
reg = <0xff200420 0x20>;
interrupts = <0 12 4>;
clock-frequency = <50000000>;
current-speed = <115200>;
};
它们是指定中断控制器的 soc
节点的子节点。
我终于发现了这个问题,从 RCU 调度程序堆栈跟踪来看这并不奇怪:我的 IRQ 是错误的。
我不太了解它的确切机制,因为我不是固件工程师,但 UART 模块的 IRQ 偏移量为 40,因此它们的 IRQ 不是我想的 9 和 12,但 49 和 52。更新 DTS 以匹配导致一切按预期工作。