Re: [syzbot] [net?] INFO: rcu detected stall in unix_release

From: Vladimir Oltean
Date: Wed Aug 16 2023 - 18:59:13 EST


Hi Jakub,

On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote:
> > Hi Vladimir, any ideas for this one?
> > The bisection looks pooped, FWIW, looks like a taprio inf loop.
>
> I'm looking into it.

Here's what I've found out and what help I'll need going forward.

Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(),
leading to an RCU stall.

Short description of taprio_dequeue_tc_priority(): it cycles
q->cur_txq[tc] in the range between [ offset, offset + count ), where:

int offset = dev->tc_to_txq[tc].offset;
int count = dev->tc_to_txq[tc].count;

with the initial q->cur_txq[tc], aka the "first_txq" variable, being set
by the control path: taprio_change(), also called by taprio_init():

if (mqprio) {
(...)
for (i = 0; i < mqprio->num_tc; i++) {
(...)
q->cur_txq[i] = mqprio->offset[i];
}
}

In the buggy case that leads to the RCU stall, the line in taprio_change()
which sets q->cur_txq[i] never gets executed. So first_txq will be 0
(pre-initialized memory), and if that's outside of the [ offset, offset + count )
range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects
to cycle through, the kernel is toast.

The nitty gritty of that is boring. What's not boring is how come the
control path skips the q->cur_txq[i] assignment. It's because "mqprio"
is NULL, and that's because taprio_change() (important: also tail-called
from taprio_init()) has this logic to detect a change in the traffic
class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP
netlink attribute:

/* no changes - no new mqprio settings */
if (!taprio_mqprio_cmp(q, dev, mqprio))
mqprio = NULL;

And what happens is that:
- we go through taprio_init()
- a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us
- taprio_mqprio_cmp() sees that there's no change compared to the
netdev's existing traffic class config
- taprio_change() sets "mqprio" to NULL, ignoring the given
TCA_TAPRIO_ATTR_PRIOMAP
- we skip modifying q->cur_txq[i], as if it was a taprio_change() call
that came straight from Qdisc_ops :: change(), rather than what it
really is: one from Qdisc_ops :: init()

So the next question: why does taprio_mqprio_cmp() see that there's no
change? Because there is no change. When Qdisc_ops :: init() is called,
the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and
all that.

But why? A previous taprio, if that existed, will call taprio_destroy()
-> netdev_reset_tc(), so it won't leave state behind that will hinder
the current taprio. Checking for stuff in the netdev state is just so
that taprio_change() can distinguish between a direct Qdisc_ops :: change()
call vs one coming from init().

Finally, here's where the syzbot repro becomes relevant. It crafts the
RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc()
in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written
in mind.

With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is
supposed to call init() the first time and replace() the second time.
What the repro does is make the above sequence call two init() methods
back to back.

To create an iproute2-based reproducer rather than the C one provided by
syzbot, we need this iproute2 change:

diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index 56086c43b7fa..20d9622b6bf3 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv)
return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1);
if (matches(*argv, "change") == 0)
return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1);
+ if (strcmp(*argv, "replace-exclusive") == 0)
+ return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1);
if (matches(*argv, "replace") == 0)
return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1);
if (matches(*argv, "link") == 0)

which basically implements a crafted alternative of "tc qdisc replace"
which also sets the NLM_F_EXCL flag in n->nlmsg_flags.

Then, the minimal repro script can simply be expressed as:

#!/bin/bash

ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1
ip link set veth0 up && ip link set veth1 up

for ((i = 0; i < 2; i++)); do
tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \
num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
queues 8@0 4@8 \
clockid REALTIME \
base-time 0 \
cycle-time 61679 \
sched-entry S 0 54336 \
sched-entry S 0x8a27 7343 \
max-sdu 18343 18343 \
flags 0
done

ip link del veth0

Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init()
code path instead of change() for the second Qdisc.

The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself)
to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs.

When the taprio_init() method executes for i=1, taprio_destroy() hasn't
been called yet. So neither has netdev_reset_tc() been called, and
that's part of the problem (the one that causes the infinite loop in
dequeue()).

But, taprio_destroy() will finally get called for the initial taprio
created at i=0. The call trace looks like this:

rtnetlink_rcv_msg()
-> tc_modify_qdisc()
-> qdisc_graft()
-> taprio_attach() for i=1
-> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0
-> __qdisc_destroy()
-> taprio_destroy()

What's more interesting is that the late taprio_destroy() for i=0
effectively destroys the netdev state - the netdev_reset_tc() call -
done by taprio_init() -> taprio_change() for i=1, and that can't be
too good, either. Even if there's no immediately observable hang, the
traffic classes are reset even though the Qdisc thinks they aren't.

Taprio isn't the only one affected by this. Mqprio also has the pattern
of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy().
But with the possibility of destroy(i=0) not being serialized with
init(i=1), that's buggy.

Sorry for the long message. This is where I'm at. For me, this is the
bottom of where things are intuitive. I don't understand what is
considered to be expected behavior from tc_modify_qdisc(), and what is
considered to be sane Qdisc-facing API, and I need help.

I've completely stopped debugging when I saw that the code enters
through this path at i=1, so I really can't tell you more:

/* This magic test requires explanation.
*
* We know, that some child q is already
* attached to this parent and have choice:
* either to change it or to create/graft new one.
*
* 1. We are allowed to create/graft only
* if CREATE and REPLACE flags are set.
*
* 2. If EXCL is set, requestor wanted to say,
* that qdisc tcm_handle is not expected
* to exist, so that we choose create/graft too.
*
* 3. The last case is when no flags are set.
* Alas, it is sort of hole in API, we
* cannot decide what to do unambiguously.
* For now we select create/graft, if
* user gave KIND, which does not match existing.
*/
if ((n->nlmsg_flags & NLM_F_CREATE) &&
(n->nlmsg_flags & NLM_F_REPLACE) &&
((n->nlmsg_flags & NLM_F_EXCL) ||
(tca[TCA_KIND] &&
nla_strcmp(tca[TCA_KIND], q->ops->id)))) {
netdev_err(dev, "magic test\n");
goto create_n_graft;
}

I've added more Qdisc people to the discussion. The problem description
is pretty much self-contained in this email, and going to the original
syzbot report won't bring much else.

There are multiple workarounds that can be done in taprio (and mqprio)
depending on what is considered as being sane API. Though I don't want
to get ahead of myself. Maybe there is a way to fast-forward the
qdisc_destroy() of the previous taprio so it doesn't overlap with the
new one's qdisc_create().