Re: INFO: rcu detected stall in wg_packet_tx_worker

From: Jason A. Donenfeld
Date: Sun Apr 26 2020 - 16:46:27 EST


On Sun, Apr 26, 2020 at 2:38 PM Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
>
>
>
> On 4/26/20 1:26 PM, Eric Dumazet wrote:
> >
> >
> > On 4/26/20 12:42 PM, Jason A. Donenfeld wrote:
> >> On Sun, Apr 26, 2020 at 1:40 PM Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> On 4/26/20 10:57 AM, syzbot wrote:
> >>>> syzbot has bisected this bug to:
> >>>>
> >>>> commit e7096c131e5161fa3b8e52a650d7719d2857adfd
> >>>> Author: Jason A. Donenfeld <Jason@xxxxxxxxx>
> >>>> Date: Sun Dec 8 23:27:34 2019 +0000
> >>>>
> >>>> net: WireGuard secure network tunnel
> >>>>
> >>>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=15258fcfe00000
> >>>> start commit: b2768df2 Merge branch 'for-linus' of git://git.kernel.org/..
> >>>> git tree: upstream
> >>>> final crash: https://syzkaller.appspot.com/x/report.txt?x=17258fcfe00000
> >>>> console output: https://syzkaller.appspot.com/x/log.txt?x=13258fcfe00000
> >>>> kernel config: https://syzkaller.appspot.com/x/.config?x=b7a70e992f2f9b68
> >>>> dashboard link: https://syzkaller.appspot.com/bug?extid=0251e883fe39e7a0cb0a
> >>>> userspace arch: i386
> >>>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15f5f47fe00000
> >>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11e8efb4100000
> >>>>
> >>>> Reported-by: syzbot+0251e883fe39e7a0cb0a@xxxxxxxxxxxxxxxxxxxxxxxxx
> >>>> Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
> >>>>
> >>>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> >>>>
> >>>
> >>> I have not looked at the repro closely, but WireGuard has some workers
> >>> that might loop forever, cond_resched() might help a bit.
> >>
> >> I'm working on this right now. Having a bit difficult of a time
> >> getting it to reproduce locally...
> >>
> >> The reports show the stall happening always at:
> >>
> >> static struct sk_buff *
> >> sfq_dequeue(struct Qdisc *sch)
> >> {
> >> struct sfq_sched_data *q = qdisc_priv(sch);
> >> struct sk_buff *skb;
> >> sfq_index a, next_a;
> >> struct sfq_slot *slot;
> >>
> >> /* No active slots */
> >> if (q->tail == NULL)
> >> return NULL;
> >>
> >> next_slot:
> >> a = q->tail->next;
> >> slot = &q->slots[a];
> >>
> >> Which is kind of interesting, because it's not like that should block
> >> or anything, unless there's some kasan faulting happening.
> >>
> >
> > I am not really sure WireGuard is involved, the repro does not rely on it anyway.
> >
>
> Yes, do not spend too much time on this.
>
> syzbot found its way into crazy qdisc settings these last days.
>
> ( I sent a patch yesterday for choke qdisc, it seems similar checks are needed in sfq )

Ah, whew, okay. I had just begun instrumenting sfq (the highly
technical term for "adding printks everywhere") to figure out what's
going on. Looks like you've got a handle on it, so I'll let you have
at it.

On the brighter side, it seems like Dmitry's and my effort to get full
coverage of WireGuard has paid off in the sense that tons of packets
wind up being shoveled through it in one way or another, which is
good.