Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting

From: Peilin Ye
Date: Wed May 10 2023 - 16:11:32 EST


On Mon, May 08, 2023 at 06:33:24PM -0700, Jakub Kicinski wrote:
> Great analysis, thanks for squashing this bug.

Thanks, happy to help!

> Have you considered creating a fix more localized to the miniq
> implementation? It seems that having per-device miniq pointers is
> incompatible with using reference counted objects. So miniq is
> a more natural place to solve the problem. Otherwise workarounds
> in the core keep piling up (here qdisc_graft()).
>
> Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
> If active qdisc is neither a1 nor a2 we should leave the dev state
> alone.

Yes, I have tried fixing this in mini_qdisc_pair_swap(), but I am afraid
it is hard:

(3rd) is called from ->destroy(), so currently it uses RCU_INIT_POINTER()
to set dev->miniq_ingress to NULL. It will need a logic like:

I am A. Set dev->miniq_ingress to NULL, if and only if it is a1 or a2,
and do it atomically.

We need more than a cmpxchg() to implement this "set NULL iff a1 or a2".
Additionally:

On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...

Looking at the code, I think there is no guarantee that (1st) cannot
happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
handlers get preempted?

If (1st) happens later than (2nd), we will need to make (1st) no-op, by
detecting that we are the "old" Qdisc. I am not sure there is any
(clean) way to do it. I even thought about:

(1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
(2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it
means we are the "old" Qdisc (have been replaced), and should do
nothing.

However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
egress one, so I can't container_of() during step (1) ...

Eventually I created [5,6/6]. It is a workaround indeed, in the sense
that it changes sch_api.c to avoid a mini Qdisc issue. However I think it
makes the code correct in a relatively understandable way, without slowing
down mini_qdisc_pair_swap() or sch_handle_*gress().

Thanks,
Peilin Ye