Re: BUG: in squashfs_xz_uncompress() (Was: RCU stalls in squashfs_readahead())

From: Mirsad Todorovac
Date: Tue Dec 20 2022 - 05:43:36 EST


On 11/18/22 17:51, Elliott, Robert (Servers) wrote:


-----Original Message-----
From: Phillip Lougher <phillip@xxxxxxxxxxxxxxx>
Sent: Friday, November 18, 2022 12:11 AM
To: Mirsad Goran Todorovac <mirsad.todorovac@xxxxxxxxxxxx>; LKML <linux-
kernel@xxxxxxxxxxxxxxx>; Paul E. McKenney <paulmck@xxxxxxxxxx>
Cc: phillip.lougher@xxxxxxxxx; Thorsten Leemhuis
<regressions@xxxxxxxxxxxxx>
Subject: Re: BUG: in squashfs_xz_uncompress() (Was: RCU stalls in
squashfs_readahead())

On 17/11/2022 23:05, Mirsad Goran Todorovac wrote:
Hi,

While trying to bisect, I've found another bug that predated the
introduction of squashfs_readahead(), but it has
a common denominator in squashfs_decompress() and
squashfs_xz_uncompress().

Wrong, the stall is happening in the XZ decompressor library, which
is *not* in Squashfs.

This reported stall in the decompressor code is likely a symptom of you
deliberately thrashing your system. When the system thrashes everything
starts to happen very slowly, and the system will spend a lot of
its time doing page I/O, and the CPU will spend a lot of time in
any CPU intensive code like the XZ decompressor library.

So the fact the stall is being hit here is a symptom and not
a cause. The decompressor code is likely running slowly due to
thrashing and waiting on paged-out buffers. This is not indicative
of any bug, merely a system running slowly due to overload.

As I said, this is not a Squashfs issue, because the code when the
stall takes place isn't in Squashfs.

The people responsible for the rcu code should have a lot more insight
about what happens when the system is thrashing, and how this will
throw up false positives. In this I believe this is an instance of
perfectly correct code running slowly due to thrashing incorrectly
being flagged as looping.

CC'ing Paul E. McKenney <paulmck@xxxxxxxxxx>

Phillip

How big can these readahead sizes be? Should one of the loops include
cond_resched() calls?

Please allow me to assert that 6.1.0+ kernel (this Berlin time 6 AM morning's build on on Torvalds' tree) built with CONFIG_KMEMLEAK=y, CONFIG_KASAN=y, CONFIG_LRU_GEN=y (multi-gen LRU) and
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=20 doesn't exhibit before seen RCU stalls even with such a low timeout as 20 ms.

So I guess kudos go to the MG-LRU developers, or has Mr. Lougher done something efficient in the meantime.

My $0.02!

Thank you,
Mirsad

--
Mirsad Goran Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu
--
System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia