Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

From: Yunsheng Lin
Date: Tue Apr 06 2021 - 08:24:23 EST


On 2021/4/6 15:31, Michal Kubecek wrote:
> On Tue, Apr 06, 2021 at 10:46:29AM +0800, Yunsheng Lin wrote:
>> On 2021/4/6 9:49, Cong Wang wrote:
>>> On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <jikos@xxxxxxxxxx> wrote:
>>>>
>>>> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
>>>> coming days. If it works, then we can consider proceeding with it,
>>>> otherwise I am all for reverting the whole NOLOCK stuff.
>>>>
>>>> [1] https://lore.kernel.org/linux-can/1616641991-14847-1-git-send-email-linyunsheng@xxxxxxxxxx/T/#u
>>>
>>> I personally prefer to just revert that bit, as it brings more troubles
>>> than gains. Even with Yunsheng's patch, there are still some issues.
>>> Essentially, I think the core qdisc scheduling code is not ready for
>>> lockless, just look at those NOLOCK checks in sch_generic.c. :-/
>>
>> I am also awared of the NOLOCK checks too:), and I am willing to
>> take care of it if that is possible.
>>
>> As the number of cores in a system is increasing, it is the trend
>> to become lockless, right? Even there is only one cpu involved, the
>> spinlock taking and releasing takes about 30ns on our arm64 system
>> when CONFIG_PREEMPT_VOLUNTARY is enable(ip forwarding testing).
>
> I agree with the benefits but currently the situation is that we have
> a race condition affecting the default qdisc which is being hit in
> production and can cause serious trouble which is made worse by commit
> 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host
> queues") preventing the retransmits of the stuck packet being sent.
>
> Perhaps rather than patching over current implementation which requires
> more and more complicated hacks to work around the fact that we cannot
> make the "queue is empty" check and leaving the critical section atomic,
> it would make sense to reimplement it in a way which would allow us
> making it atomic.

Yes, reimplementing that is also an option.
But what if reimplemention also has the same problem if we do not find
the root cause of this problem? I think it better to find the root cause
of it first?

>
> Michal
>
>
> .
>