Re: [syzbot] KASAN: use-after-free Read in xfs_qm_dqfree_one

From: Paul E. McKenney
Date: Tue Dec 06 2022 - 10:32:28 EST


On Tue, Dec 06, 2022 at 12:06:10PM +0100, Dmitry Vyukov wrote:
> On Tue, 6 Dec 2022 at 04:34, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Dec 05, 2022 at 07:12:15PM -0800, syzbot wrote:
> > > Hello,
> > >
> > > syzbot has tested the proposed patch but the reproducer is still triggering an issue:
> > > INFO: rcu detected stall in corrupted
> > >
> > > rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P4122 } 2641 jiffies s: 2877 root: 0x0/T
> > > rcu: blocking rcu_node structures (internal RCU debug):
> >
> > I'm pretty sure this has nothing to do with the reproducer - the
> > console log here:
> >
> > > Tested on:
> > >
> > > commit: bce93322 proc: proc_skip_spaces() shouldn't think it i..
> > > git tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=1566216b880000
> >
> > indicates that syzbot is screwing around with bluetooth, HCI,
> > netdevsim, bridging, bonding, etc.
> >
> > There's no evidence that it actually ran the reproducer for the bug
> > reported in this thread - there's no record of a single XFS
> > filesystem being mounted in the log....
> >
> > It look slike someone else also tried a private patch to fix this
> > problem (which was obviously broken) and it failed with exactly the
> > same RCU warnings. That was run from the same commit id as the
> > original reproducer, so this looks like either syzbot is broken or
> > there's some other completely unrelated problem that syzbot is
> > tripping over here.
> >
> > Over to the syzbot people to debug the syzbot failure....
>
> Hi Dave,
>
> It's not uncommon for a single program to trigger multiple bugs.
> That's what happens here. The rcu stall issue is reproducible with
> this test program.
> In such cases you can either submit more test requests, or test manually.
>
> I think there is an RCU expedited stall detection.
> For some reason CONFIG_RCU_EXP_CPU_STALL_TIMEOUT is limited to 21
> seconds, and that's not enough for reliable flake-free stress testing.
> We bump other timeouts to 100+ seconds.
> +RCU maintainers, do you mind removing the overly restrictive limit on
> CONFIG_RCU_EXP_CPU_STALL_TIMEOUT?
> Or you think there is something to fix in the kernel to not stall? I
> see the test writes to
> /proc/sys/vm/drop_caches, maybe there is some issue in that code.

Like this?

If so, I don't see why not. And in that case, may I please have
your Tested-by or similar?

At the same time, I am sure that there are things in the kernel that
should be adjusted to avoid stalls, but I recognize that different
developers in different situations will have different issues that they
choose to focus on. ;-)

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 49da904df6aa6..2984de629f749 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -82,7 +82,7 @@ config RCU_CPU_STALL_TIMEOUT
config RCU_EXP_CPU_STALL_TIMEOUT
int "Expedited RCU CPU stall timeout in milliseconds"
depends on RCU_STALL_COMMON
- range 0 21000
+ range 0 300000
default 0
help
If a given expedited RCU grace period extends more than the