Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

From: Yunsheng Lin
Date: Tue Apr 06 2021 - 08:18:00 EST


On 2021/4/6 18:13, Juergen Gross wrote:
> On 06.04.21 09:06, Michal Kubecek wrote:
>> On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
>>>
>>> Hi, Jiri
>>> Do you have a reproducer that can be shared here?
>>> With reproducer, I can debug and test it myself too.
>>
>> I'm afraid we are not aware of a simple reproducer. As mentioned in the
>> original discussion, the race window is extremely small and the other
>> thread has to do quite a lot in the meantime which is probably why, as
>> far as I know, this was never observed on real hardware, only in
>> virtualization environments. NFS may also be important as, IIUC, it can
>> often issue an RPC request from a different CPU right after a data
>> transfer. Perhaps you could cheat a bit and insert a random delay
>> between the empty queue check and releasing q->seqlock to make it more
>> likely to happen.
>>
>> Other than that, it's rather just "run this complex software in a xen VM
>> and wait".
>
> Being the one who has managed to reproduce the issue I can share my
> setup, maybe you can setup something similar (we have seen the issue
> with this kind of setup on two different machines).
>
> I'm using a physical machine with 72 cpus and 48 GB of memory. It is
> running Xen as virtualization platform.
>
> Xen dom0 is limited to 40 vcpus and 32 GB of memory, the dom0 vcpus are
> limited to run on the first 40 physical cpus (no idea whether that
> matters, though).
>
> In a guest with 16 vcpu and 8GB of memory I'm running 8 parallel
> sysbench instances in a loop, those instances are prepared via
>
> sysbench --file-test-mode=rndrd --test=fileio prepare
>
> and then started in a do while loop via:
>
> sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 --max-time=300 --max-requests=0 run
>
> Each instance is using a dedicated NFS mount to run on. The NFS
> server for the 8 mounts is running in dom0 of the same server, the
> data of the NFS shares is located in a RAM disk (size is a little bit
> above 16GB). The shares are mounted in the guest with:
>
> mount -t nfs -o rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport dom0:/ramdisk/share[1-8] /mnt[1-8]
>
> The guests vcpus are limited to run on physical cpus 40-55, on the same
> physical cpus I have 16 small guests running eating up cpu time, each of
> those guests is pinned to one of the physical cpus 40-55.
>
> That's basically it. All you need to do is to watch out for sysbench
> reporting maximum latencies above one second or so (in my setup there
> are latencies of several minutes at least once each hour of testing).
>
> In case you'd like to have some more details about the setup don't
> hesitate to contact me directly. I can provide you with some scripts
> and config runes if you want.

The setup is rather complex, I just tried Michal' suggestion using
the below patch:

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9fb0ad4..b691eda 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -207,6 +207,11 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
{
write_seqcount_end(&qdisc->running);
if (qdisc->flags & TCQ_F_NOLOCK) {
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
spin_unlock(&qdisc->seqlock);

if (unlikely(test_bit(__QDISC_STATE_MISSED,
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6d7f954..a83c520 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
return qdisc_drop_cpu(skb, qdisc, to_free);
else
return qdisc_drop(skb, qdisc, to_free);
+ } else {
+ skb->enqueue_jiffies = jiffies;
}

qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -653,6 +655,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
skb = __skb_array_consume(q);
}
if (likely(skb)) {
+ unsigned int delay_ms;
+
+ delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
linyunsheng@plinth:~/ci/kernel$ vi qdisc_reproducer.patch
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -920,7 +920,7 @@ struct sk_buff {
*data;
unsigned int truesize;
refcount_t users;
-
+ unsigned long enqueue_jiffies;
#ifdef CONFIG_SKB_EXTENSIONS
/* only useable after checking ->active_extensions != 0 */
struct skb_ext *extensions;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 639e465..ba39b86 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -176,8 +176,14 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
static inline void qdisc_run_end(struct Qdisc *qdisc)
{
write_seqcount_end(&qdisc->running);
- if (qdisc->flags & TCQ_F_NOLOCK)
+ if (qdisc->flags & TCQ_F_NOLOCK) {
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
spin_unlock(&qdisc->seqlock);
+ }
}

static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 49eae93..fcfae43 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
return qdisc_drop_cpu(skb, qdisc, to_free);
else
return qdisc_drop(skb, qdisc, to_free);
+ } else {
+ skb->enqueue_jiffies = jiffies;
}

qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -651,6 +653,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
skb = __skb_array_consume(q);
}
if (likely(skb)) {
+ unsigned int delay_ms;
+
+ delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
+
+ if (delay_ms > 100)
+ netdev_err(qdisc_dev(qdisc), "delay: %u ms\n", delay_ms);
+
qdisc_update_stats_at_dequeue(qdisc, skb);
} else {
WRITE_ONCE(qdisc->empty, true);


Using the below shell:

while((1))
do
taskset -c 0 mz dummy1 -d 10000 -c 100&
taskset -c 1 mz dummy1 -d 10000 -c 100&
taskset -c 2 mz dummy1 -d 10000 -c 100&
sleep 3
done

And got the below log:
[ 80.881716] hns3 0000:bd:00.0 eth2: delay: 176 ms
[ 82.036564] hns3 0000:bd:00.0 eth2: delay: 296 ms
[ 87.065820] hns3 0000:bd:00.0 eth2: delay: 320 ms
[ 94.134174] dummy1: delay: 1588 ms
[ 94.137570] dummy1: delay: 1580 ms
[ 94.140963] dummy1: delay: 1572 ms
[ 94.144354] dummy1: delay: 1568 ms
[ 94.147745] dummy1: delay: 1560 ms
[ 99.065800] hns3 0000:bd:00.0 eth2: delay: 264 ms
[ 100.106174] dummy1: delay: 1448 ms
[ 102.169799] hns3 0000:bd:00.0 eth2: delay: 436 ms
[ 103.166221] dummy1: delay: 1604 ms
[ 103.169617] dummy1: delay: 1604 ms
[ 104.985806] dummy1: delay: 316 ms
[ 105.113797] hns3 0000:bd:00.0 eth2: delay: 308 ms
[ 107.289805] hns3 0000:bd:00.0 eth2: delay: 556 ms
[ 108.912922] hns3 0000:bd:00.0 eth2: delay: 188 ms
[ 137.241801] dummy1: delay: 30624 ms
[ 137.245283] dummy1: delay: 30620 ms
[ 137.248760] dummy1: delay: 30616 ms
[ 137.252237] dummy1: delay: 30616 ms


It seems the problem can be easily reproduced using Michal'
suggestion.

Will test and debug it using the above reproducer first.

>
>
> Juergen