Re: BUG: BISECTED: in squashfs_xz_uncompress() (Was: RCU stalls in squashfs_readahead())

From: Mirsad Goran Todorovac
Date: Tue Dec 06 2022 - 15:36:10 EST


On 24. 11. 2022. 20:32, Phillip Lougher wrote:
On 24/11/2022 18:04, Mirsad Goran Todorovac wrote:
On 24. 11. 2022. 18:19, Paul E. McKenney wrote:
On Thu, Nov 24, 2022 at 06:06:13PM +0100, Mirsad Goran Todorovac wrote:
On 23. 11. 2022. 20:09, Paul E. McKenney wrote:

If you build with (say) CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=200, does
this still happen?

BTW, you don't need to rebuild the kernel to change those parameters; they're
module parameters, so can be modified on the kernel command line (if needed
during boot) and sysfs (if only needed after boot).

For sysfs the syntax is:
#!/bin/bash
# set rcu timeouts to specified values
echo 60 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
echo 21000 > /sys/module/rcupdate/parameters/rcu_exp_cpu_stall_timeout
echo 600000 > /sys/module/rcupdate/parameters/rcu_task_stall_timeout
grep -Hs . /sys/module/rcupdate/parameters/rcu_*_timeout

Excellent point, thank you!

I hope that this makes Mirsad's life easier, perhaps featuring less time
waiting for kernel builds and reboots.  ;-)

Unfortunately, the first stall and NMI occurs before any system script or setting a /sys/module/rcupdate/parameters/*
could be executed, as second 14 of the boot process:

[   14.320045] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 7-.... } 6 jiffies s: 105 root: 0x80/.
[   14.320064] rcu: blocking rcu_node structures (internal RCU debug):

...

Probably something sensible should be set in the case of KASAN build. This example of stall
apparently has nothing to do with squashfs_readahead().

Can't have everything, I guess!

How about building your kernel with CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=200?
Again, mainline defaults to 21000.

Did just that, and so far there is no modprobe stall in second 14 of boot at least. Looks good.
Probably it is too early to say anything in general before more uptime and stress load.

BTW, the 20 for CONFIG_RCU_EXP_STALL_TIMEOUT wasn't my invention, but it comes from generic
Ubuntu stock kernel (but without KASAN or KMEMLEAK config options):

# grep STALL /boot/config-5.19.5-051905-generic
CONFIG_RCU_STALL_COMMON=y
# CONFIG_HEADERS_INSTALL is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=20 > #

That has been raised as a bug, and a fix has been committed.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1991951

P.S.

As for the comment that I am thrashing my systems, I now test activated MG-LRU kernel option
with 6.1-rc8 build and it functions much better, with no multimedia lags or chirps, even
with only 130/8192 MiB free and 5/10 GiB in swap area.

I am running basically the same load of simultaneously opened Firefox, Chrome and Thunderbird
windows.

However, I have set CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0

The conclusion is that the squashfs isn't blocking, but 6 to 8 jiffies were not enough to
complete the operation, so other CPUs issued NMIs. With longer timeout, it is evident that
it was a longer operation due to KASAN build and not a lockup.

So I think I have to apologise to have wasted so much of your time with a false alarm.

To summarise, the culprit was obviously the CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=20 setting
from the Ubuntu mainline kernel stock, which I unsuspectedly copied into my build and
made a recommended "make olddefconfig".

Thanks,
Mirsad

--
Mirsad Goran Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu
--
System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia
The European Union