PROBLEM: deadlock when imx-sdma logs "restart cyclic channel"

From: Tim van der Staaij | Zign
Date: Fri Jul 14 2023 - 10:36:51 EST


Hello,

I use an i.MX6ULL UART with DMA enabled. Sometimes when a *lot* of data is received on this UART, the kernel hangs, printing this to the serial console:

[ 83.339872] imx-sdma 20ec000.sdma: restart cyclic channel 5
[ 83.340296]
[ 83.340307] ======================================================
[ 83.340312] WARNING: possible circular locking dependency detected
[ 83.340317] 5.19.14-yocto-standard #1 Not tainted
[ 83.340326] ------------------------------------------------------
[ 83.340329] ssh-keygen/236 is trying to acquire lock:
[ 83.340336] c1718ad8 (console_owner){-...}-{0:0}, at: console_emit_next_record.constprop.0+0x120/0x344
[ 83.340388]
[ 83.340388] but task is already holding lock:
[ 83.340392] c391481c (&vc->lock){-.-.}-{2:2}, at: sdma_int_handler+0xb4/0x34c [imx_sdma]
[ 83.340442]
[ 83.340442] which lock already depends on the new lock.
[ 83.340442]
[ 83.340447]
[ 83.340447] the existing dependency chain (in reverse order) is:
[ 83.340450]
[ 83.340450] -> #2 (&vc->lock){-.-.}-{2:2}:
[ 83.340469] sdma_prep_dma_cyclic+0x194/0x1f4 [imx_sdma]
[ 83.340495] imx_uart_startup+0x494/0x624
[ 83.340508] uart_port_startup+0x128/0x290
[ 83.340519] uart_port_activate+0x58/0xc0
[ 83.340530] tty_port_open+0x80/0xd0
[ 83.340543] uart_open+0x18/0x20
[ 83.340557] tty_open+0x128/0x678
[ 83.340570] chrdev_open+0xc0/0x214
[ 83.340582] do_dentry_open+0x1a8/0x3fc
[ 83.340596] path_openat+0xb48/0xe68
[ 83.340606] do_filp_open+0x5c/0xd4
[ 83.340618] do_sys_openat2+0xb8/0x188
[ 83.340631] sys_openat+0x8c/0xd8
[ 83.340642] ret_fast_syscall+0x0/0x1c
[ 83.340655] 0xbedf9078
[ 83.340662]
[ 83.340662] -> #1 (&port_lock_key){-.-.}-{2:2}:
[ 83.340681] imx_uart_console_write+0x16c/0x1a8
[ 83.340692] console_emit_next_record.constprop.0+0x18c/0x344
[ 83.340707] console_unlock+0x10c/0x270
[ 83.340719] register_console+0x154/0x2e8
[ 83.340732] uart_add_one_port+0x4fc/0x564
[ 83.340745] imx_uart_probe+0x4c0/0x71c
[ 83.340755] platform_probe+0x58/0xb0
[ 83.340770] really_probe.part.0+0x9c/0x2b0
[ 83.340783] __driver_probe_device+0xa0/0x138
[ 83.340793] driver_probe_device+0x30/0x10c
[ 83.340805] __driver_attach+0x8c/0x178
[ 83.340815] bus_for_each_dev+0x80/0xcc
[ 83.340832] bus_add_driver+0x154/0x1e8
[ 83.340843] driver_register+0x88/0x11c
[ 83.340854] imx_uart_init+0x20/0x40
[ 83.340866] do_one_initcall+0x7c/0x43c
[ 83.340880] kernel_init_freeable+0x17c/0x228
[ 83.340892] kernel_init+0x14/0x140
[ 83.340908] ret_from_fork+0x14/0x28
[ 83.340918] 0x0
[ 83.340925]
[ 83.340925] -> #0 (console_owner){-...}-{0:0}:
[ 83.340946] lock_acquire.part.0+0xb0/0x250
[ 83.340959] console_emit_next_record.constprop.0+0x168/0x344
[ 83.340975] console_unlock+0x10c/0x270
[ 83.340987] vprintk_emit+0x114/0x358
[ 83.340999] dev_vprintk_emit+0x114/0x150
[ 83.341016] dev_printk_emit+0x2c/0x5c
[ 83.341030] __dev_printk+0x4c/0x6c
[ 83.341042] _dev_warn+0x3c/0x70
[ 83.341057] sdma_int_handler+0x308/0x34c [imx_sdma]
[ 83.341085] __handle_irq_event_percpu+0x9c/0x338
[ 83.341096] handle_irq_event+0x38/0xd0
[ 83.341107] handle_fasteoi_irq+0x98/0x204
[ 83.341120] handle_irq_desc+0x1c/0x2c
[ 83.341130] gic_handle_irq+0x6c/0x90
[ 83.341146] generic_handle_arch_irq+0x2c/0x64
[ 83.341159] call_with_stack+0x18/0x20
[ 83.341173] __irq_usr+0x84/0xa0
[ 83.341184] 0xb6d35e48
[ 83.341191]
[ 83.341191] other info that might help us debug this:
[ 83.341191]
[ 83.341196] Chain exists of:
[ 83.341196] console_owner --> &port_lock_key --> &vc->lock
[ 83.341196]
[ 83.341221] Possible unsafe locking scenario:
[ 83.341221]
[ 83.341225] CPU0 CPU1
[ 83.341228] ---- ----
[ 83.341231] lock(&vc->lock);
[ 83.341239] lock(&port_lock_key);
[ 83.341248] lock(&vc->lock);
[ 83.341257] lock(console_owner);
[ 83.341266]
[ 83.341266] *** DEADLOCK ***
[ 83.341266]
[ 83.341270] 2 locks held by ssh-keygen/236:
[ 83.341277] #0: c391481c (&vc->lock){-.-.}-{2:2}, at: sdma_int_handler+0xb4/0x34c [imx_sdma]
[ 83.341323] #1: c17189d8 (console_lock){+.+.}-{0:0}, at: dev_vprintk_emit+0x114/0x150
[ 83.341359]
[ 83.341359] stack backtrace:
[ 83.341365] CPU: 0 PID: 236 Comm: ssh-keygen Not tainted 5.19.14-yocto-standard #1
[ 83.341376] Hardware name: Freescale i.MX6 Ultralite (Device Tree)
[ 83.341387] unwind_backtrace from show_stack+0x10/0x14
[ 83.341407] show_stack from dump_stack_lvl+0x60/0x90
[ 83.341427] dump_stack_lvl from check_noncircular+0x118/0x188
[ 83.341446] check_noncircular from __lock_acquire+0x1484/0x2b8c
[ 83.341463] __lock_acquire from lock_acquire.part.0+0xb0/0x250
[ 83.341481] lock_acquire.part.0 from console_emit_next_record.constprop.0+0x168/0x344
[ 83.341503] console_emit_next_record.constprop.0 from console_unlock+0x10c/0x270
[ 83.341524] console_unlock from vprintk_emit+0x114/0x358
[ 83.341544] vprintk_emit from dev_vprintk_emit+0x114/0x150
[ 83.341567] dev_vprintk_emit from dev_printk_emit+0x2c/0x5c
[ 83.341590] dev_printk_emit from __dev_printk+0x4c/0x6c
[ 83.341610] __dev_printk from _dev_warn+0x3c/0x70
[ 83.341633] _dev_warn from sdma_int_handler+0x308/0x34c [imx_sdma]
[ 83.341672] sdma_int_handler [imx_sdma] from __handle_irq_event_percpu+0x9c/0x338
[ 83.341705] __handle_irq_event_percpu from handle_irq_event+0x38/0xd0
[ 83.341720] handle_irq_event from handle_fasteoi_irq+0x98/0x204
[ 83.341740] handle_fasteoi_irq from handle_irq_desc+0x1c/0x2c
[ 83.341755] handle_irq_desc from gic_handle_irq+0x6c/0x90
[ 83.341775] gic_handle_irq from generic_handle_arch_irq+0x2c/0x64
[ 83.341798] generic_handle_arch_irq from call_with_stack+0x18/0x20
[ 83.341821] call_with_stack from __irq_usr+0x84/0xa0
[ 83.341838] Exception stack(0xe0dd5fb0 to 0xe0dd5ff8)
[ 83.341850] 5fa0: 00000000 00000007 015ee9dc 00000000
[ 83.341864] 5fc0: 00000080 00000000 00000000 00000010 00000040 00000002 000000c0 37ea2aa2
[ 83.341874] 5fe0: ffffffff bed21458 00000000 b6d35e48 60030030 ffffffff

I noticed that sdma_int_handler calls sdma_update_channel_loop while &sdmac->vc is locked. However, sdma_update_channel_loop can log "restart cyclic channel" in some situations, which will eventually attempt to acquire the lock of the console owner, which I suppose is the same lock.

Steps to reproduce:

Unfortunately I didn't find a way to consistently reproduce this issue. I can only say that I observed it multiple times in different boots, but only when large amounts of data are received.

# cat /proc/version
Linux version 5.19.14-yocto-standard (oe-user@oe-host) (arm-poky-linux-gnueabi-gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0.20220819) #1 SMP Thu Oct 6 18:16:52 UTC 2022

Note: I didn't test on a newer kernel because the diff with this version shows no relevant changes.

# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 64.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt fpd32 lpae
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 5
Hardware : Freescale i.MX6 Ultralite (Device Tree)

Relevant device tree additions:

/* Serial console UART */
&uart1 {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_uart1>;
status = "okay";
};

/* RS485 UART with DMA */
&uart5 {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_uart5>;
dmas = <&sdma 33 4 1>, <&sdma 34 4 2>;
dma-names = "rx", "tx";
fsl,uart-has-rtscts;
linux,rs485-enabled-at-boot-time;
status = "okay";
};

Kind regards,

Tim