Re: [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc

From: Yunsheng Lin
Date: Mon Apr 19 2021 - 21:45:29 EST


On 2021/4/19 22:57, Michal Kubecek wrote:
> On Mon, Apr 19, 2021 at 10:04:27AM +0800, Yunsheng Lin wrote:
>>>
>>> I tried this patch o top of 5.12-rc7 with real devices. I used two
>>> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
>>> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
>>> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
>>> a saturated ethernet, the CPU consumption grows quite a lot:
>>>
>>> threads unpatched 5.12-rc7 5.12-rc7 + v3
>>> 1 25.6% 30.6%
>>> 8 73.1% 241.4%
>>> 128 132.2% 1012.0%
>>>
>>> (The values are normalized to one core, i.e. 100% corresponds to one
>>> fully used logical CPU.) I didn't perform a full statistical evaluation
>>> but the growth is way beyond any statistical fluctuation with one
>>> exception: 8-thread test of patched kernel showed values from 155.5% to
>>> 311.4%. Closer look shows that most of the CPU time was spent in softirq
>>> and running top in parallel with the test confirms that there are
>>> multiple ksofirqd threads running at 100% CPU. I had similar problem
>>> with earlier versions of my patch (work in progress, I still need to
>>> check some corner cases and write commit message explaining the logic)
>>
>> Great, if there is a better idea, maybe share the core idea first so
>> that we both can work on the that?
>
> I'm not sure if it's really better but to minimize the false positives
> and unnecessary calls to __netif_schedule(), I replaced q->seqlock with
> an atomic combination of a "running" flag (which corresponds to current
> seqlock being locked) and a "drainers" count (number of other threads
> going to clean up the qdisc queue). This way we could keep track of them
> and get reliable information if another thread is going to run a cleanup
> after we leave the qdisc_run() critical section (so that there is no
> need to schedule).

It seems you are trying to match the skb enqueuing with the calling of
__qdisc_run() here, which is not reliable when considering the dequeue
batching, see try_bulk_dequeue_skb() or try_bulk_dequeue_skb_slow() in
dequeue_skb().

>
>>> The biggest problem IMHO is that the loop in __qdisc_run() may finish
>>> without rescheduling not only when the qdisc queue is empty but also
>>> when the corresponding device Tx queue is stopped which devices tend to
>>> do whenever they cannot send any more packets out. Thus whenever
>>> __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or
>>> frozen, we keep rescheduling the queue cleanup without any chance to
>>> progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all
>>> we need is another thready to fail the first spin_trylock() while device
>>> queue is stopped and qdisc queue not empty.
>>
>> Yes, We could just return false before doing the second spin_trylock() if
>> the netdev queue corresponding qdisc is stopped, and when the netdev queue
>> is restarted, __netif_schedule() is called again, see netif_tx_wake_queue().
>>
>> Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin:
>>
>> if (dont_retry || sch_qdisc_stopped())
>> return false;
>>
>> bool sch_qdisc_stopped(struct Qdisc *q)
>> {
>> const struct netdev_queue *txq = q->dev_queue;
>>
>> if (!netif_xmit_frozen_or_stopped(txq))
>> return true;
>>
>> reture false;
>> }
>>
>> At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable?
>
> Either this or you can do the check in qdisc_run_end() - when the device
> queue is stopped or frozen, there is no need to schedule as we know it's
> going to be done when the flag is cleared again (and we cannot do
> anything until then anyway).
>
>>> Another part of the problem may be that to avoid the race, the logic is
>>> too pessimistic: consider e.g. (dotted lines show "barriers" where
>>> ordering is important):
>>>
>>> CPU A CPU B
>>> spin_trylock() succeeds
>>> pfifo_fast_enqueue()
>>> ..................................................................
>>> skb_array empty, exit loop
>>> first spin_trylock() fails
>>> set __QDISC_STATE_NEED_RESCHEDULE
>>> second spin_trylock() fails
>>> ..................................................................
>>> spin_unlock()
>>> call __netif_schedule()
>>>
>>> When we switch the order of spin_lock() on CPU A and second
>>> spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE
>>> before CPU A tests it), we end up scheduling a queue cleanup even if
>>> there is already one running. And either of these is quite realistic.
>>
>> But I am not sure it is a good thing or bad thing when the above happen,
>> because it may be able to enable the tx batching?
>
> In either of the two scenarios, we call __netif_schedule() to schedule
> a cleanup which does not do anything useful. In first, the qdics queue
> is empty so that either the scheduled cleanup finds it empty or there
> will be some new packets which would have their own cleanup. In second,
> scheduling a cleanup will not prevent the other thread from doing its
> own cleanup it already started.

The main idea is that a thread(holding q->seqlock)to do the dequeuing
as much as possible while other threads are enqueuing skb, which seems
possible to avoid the above case?

>
>>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>>> index 44991ea..4953430 100644
>>>> --- a/net/sched/sch_generic.c
>>>> +++ b/net/sched/sch_generic.c
>>>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>> {
>>>> struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>>> struct sk_buff *skb = NULL;
>>>> + bool need_retry = true;
>>>> int band;
>>>>
>>>> +retry:
>>>> for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>> struct skb_array *q = band2list(priv, band);
>>>>
>>>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>> }
>>>> if (likely(skb)) {
>>>> qdisc_update_stats_at_dequeue(qdisc, skb);
>>>> + } else if (need_retry &&
>>>> + test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
>>>> + &qdisc->state)) {
>>>> + /* do another enqueuing after clearing the flag to
>>>> + * avoid calling __netif_schedule().
>>>> + */
>>>> + smp_mb__after_atomic();
>>>> + need_retry = false;
>>>> +
>>>> + goto retry;
>>>> } else {
>>>> WRITE_ONCE(qdisc->empty, true);
>>>> }i
>>>
>>> Does the retry really provide significant improvement? IIUC it only
>>> helps if all of enqueue, first spin_trylock() failure and setting the
>>> __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the
>>> skb_array empty and checking the flag. If enqueue happens before we
>>> check the array (and flag is set before the check), the retry is
>>> useless. If the flag is set after we check it, we don't catch it (no
>>> matter if the enqueue happened before or after we found skb_array
>>> empty).
>>
>> In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)"
>> as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as
>> as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the
>> queue is empty.
>
> This is what I'm worried about. We are trying to address a race
> condition which is extremely rare so we should avoid adding too much
> overhead to the normal use.
>
>> And it has about 5% performance improvement.
>
> OK then. It thought that it would do an unnecessary dequeue retry much
> more often than prevent an unnecessary __netif_schedule() but I did not
> run any special benchmark to check.

When netdev tx queue is not stopped:
1. if __QDISC_STATE_NEED_RESCHEDULE is set and there is a lot of skb to be
dequeued, it is likely that __netif_schedule() is already called in
__qdisc_run(), so the __netif_schedule() called in qdisc_run_end() is
no-op.

2. if __QDISC_STATE_NEED_RESCHEDULE is set during the data race this patch is
trying to fix, then we do need to call __netif_schedule().

3. otherwise the __QDISC_STATE_NEED_RESCHEDULE is cleared in test_and_clear()
added in pfifo_fast_dequeue().

The main point here is to delay clearing __QDISC_STATE_NEED_RESCHEDULE bit
as much as possible so that the set_bit() and second spin_trylock() is
avoided.

>
> Michal
>
> .
>