Re: BUG: BISECTED: in squashfs_xz_uncompress() (Was: RCU stalls in squashfs_readahead())

From: Paul E. McKenney
Date: Wed Nov 23 2022 - 14:09:39 EST


On Wed, Nov 23, 2022 at 04:08:03PM +0000, Elliott, Robert (Servers) wrote:
> > -----Original Message-----
> > From: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > Sent: Wednesday, November 23, 2022 8:44 AM
> > Subject: Re: BUG: BISECTED: in squashfs_xz_uncompress() (Was: RCU stalls
> > in squashfs_readahead())
> >
> > On Wed, Nov 23, 2022 at 10:14:07AM +0100, Mirsad Goran Todorovac wrote:
> ...
>
> > > commit 1045a06724f322ed61f1ffb994427c7bdbe64647
> > > Author: Christoph Hellwig <hch@xxxxxx>
> > > Date: Wed Jun 29 17:01:02 2022 +0200
> > >
> > > remove CONFIG_ANDROID
> >
> > But your CONFIG_RCU_EXP_CPU_STALL_TIMEOUT seems to be about 20.
> > As in 20 milliseconds.
> >
> > Is that intentional?
> ...
> > > > And one approach is to take a less aggressive RCU CPU stall timeout,
> > > > say reducing from 21 seconds to (say) 15 seconds instead of all the
> > > > way down to 20 milliseconds. This could allow you to ease into the
> > > > latency-reduction work.
> > > >
> > > > Alternatively, consider that response time is a property of the
> > > > entire system plus the environment that it runs in. So I suspect that
> > > > the Android folks are accompanying that 20-millisecond timeout with
> > > > some restrictions on what the on-phone workloads are permitted to do.
> > > > Maybe ask the Android guys what those restrictions are and loosen them
> > > > slightly, again allowing you to ease into the latency-reduction work.
>
> Since there is no longer a kernel default of 20 ms for Android since
> commit 1045a06724f3 ("remove CONFIG_ANDROID"), it might be appropriate
> to drop that from stallwarn.rst. Something like:

Fair point, except that the Android guys asked for this. In fact, the
Android guys supplied the patches. I like the traceability.

> diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst
> index e38c587067fc..d86a8b47504f 100644
> --- a/Documentation/RCU/stallwarn.rst
> +++ b/Documentation/RCU/stallwarn.rst
> @@ -168,10 +168,11 @@ CONFIG_RCU_EXP_CPU_STALL_TIMEOUT
> Same as the CONFIG_RCU_CPU_STALL_TIMEOUT parameter but only for
> the expedited grace period. This parameter defines the period
> of time that RCU will wait from the beginning of an expedited
> - grace period until it issues an RCU CPU stall warning. This time
> - period is normally 20 milliseconds on Android devices. A zero
> - value causes the CONFIG_RCU_CPU_STALL_TIMEOUT value to be used,
> - after conversion to milliseconds.
> + grace period until it issues an RCU CPU stall warning.
> +
> + A zero value causes the CONFIG_RCU_CPU_STALL_TIMEOUT value to be
> + used, after conversion to milliseconds, limited to a maximum of
> + 21 seconds.
>
> This configuration parameter may be changed at runtime via the
> /sys/module/rcupdate/parameters/rcu_exp_cpu_stall_timeout, however
> diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
> index 1b0c41d490f0..4477eeb8a54f 100644
> --- a/kernel/rcu/Kconfig.debug
> +++ b/kernel/rcu/Kconfig.debug
> @@ -93,7 +93,7 @@ config RCU_EXP_CPU_STALL_TIMEOUT
> If the RCU grace period persists, additional CPU stall warnings
> are printed at more widely spaced intervals. A value of zero
> says to use the RCU_CPU_STALL_TIMEOUT value converted from
> - seconds to milliseconds.
> + seconds to milliseconds, limited to a maximum of 21 seconds.
>
> > > > But please keep in mind that cond_resched() helps only in kernels
> > > > built with CONFIG_PREEMPTION=n.
>
> I think they also help with
> CONFIG_PREEMPT_VOLUNTARY=y
> which is used by Fedora and RHEL. For example, adding more of those in
> the crypto tcrypt test module eliminated some RCU stalls when running
> speed tests (that send lots of data through the algorithms).
>
> > > I have bad news that 6.1-r6 is still affected with
> > > squashfs_xz_uncompress bug, despite having both of your fixes
> > > (as visible in above command's output -- double checked):
> > >
> > > [ 91.065659] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 3-.... } 6 jiffies s: 621 root: 0x8/.
> >
> > If you build with (say) CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=200, does
> > this still happen?
>
> BTW, you don't need to rebuild the kernel to change those parameters; they're
> module parameters, so can be modified on the kernel command line (if needed
> during boot) and sysfs (if only needed after boot).
>
> For sysfs the syntax is:
> #!/bin/bash
> # set rcu timeouts to specified values
> echo 60 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> echo 21000 > /sys/module/rcupdate/parameters/rcu_exp_cpu_stall_timeout
> echo 600000 > /sys/module/rcupdate/parameters/rcu_task_stall_timeout
> grep -Hs . /sys/module/rcupdate/parameters/rcu_*_timeout

Excellent point, thank you!

I hope that this makes Mirsad's life easier, perhaps featuring less time
waiting for kernel builds and reboots. ;-)

Thanx, Paul