[PATCH v3 0/2] *** drivers: net: sun4i-emac: Fix emac_timeout ***

From: qianfanguijin
Date: Thu Apr 27 2023 - 06:53:36 EST


From: qianfan Zhao <qianfanguijin@xxxxxxx>

History:

2022-09-12:
Introduce the first patch and can read it from:
https://lkml.kernel.org/lkml/20220912063331.23369-1-qianfanguijin@xxxxxxx/
That was reviewed by Jernej Skrabec <jernej.skrabec@xxxxxxxxx> but have not
marged.

2023-04-27:

Apply the first patch and I found the bug was not fully fixed.
I also get those error messages sometimes:

[ 108.581230] spi_master spi2: spi2.1: timeout transferring 1025 bytes@100000Hz for 190(164)ms
[ 108.590337] spidev spi2.1: SPI transfer failed: -110
[ 108.595443] spi_master spi2: failed to transfer one message from queue
...

I had tried `kdump` and `crash` tools but noting is useful.

Few days later I found `softirq` takes about 100% cpu of a cpu core, listen
softirq_entry, softirq_exit, net_dev_xmit events and I got those flood
messages:

289.902631: softirq_entry: vec=2 [action=NET_TX]
289.902651: net_dev_xmit: dev=eth0 skbaddr=(ptrval) len=98 rc=16
289.902656: softirq_exit: vec=2 [action=NET_TX]
289.902659: softirq_entry: vec=2 [action=NET_TX]
289.902664: net_dev_xmit: dev=eth0 skbaddr=(ptrval) len=98 rc=16
289.902668: softirq_exit: vec=2 [action=NET_TX]
...

And then I debug the linux kernel under qemu, make the emac-driver in qemu
drop some tx packages by this way:

diff --git a/hw/net/allwinner_emac.c b/hw/net/allwinner_emac.c
index 372e5b66da..28dfb1116b 100644
--- a/hw/net/allwinner_emac.c
+++ b/hw/net/allwinner_emac.c
@@ -349,9 +349,14 @@ static void aw_emac_write(void *opaque, hwaddr offset, uint64_t value,
"allwinner_emac: TX length > fifo data length\n");
}
if (len > 0) {
+ int ignore = random() % 10 < 1;
data = fifo8_pop_buf(fifo, len, &ret);
- qemu_send_packet(nc, data, ret);
+ if (!ignore)
+ qemu_send_packet(nc, data, ret);
aw_emac_tx_reset(s, chan);
+
+ if (ignore)
+ break;
/* Raise TX interrupt */
s->int_sta |= EMAC_INT_TX_CHAN(chan);
aw_emac_update_irq(s);

It's very easy to reproduce this bug now.

Next is the backtrace of gdb when softirq was raise again:

#0 __raise_softirq_irqoff (nr=nr@entry=2) at kernel/softirq.c:699
#1 raise_softirq_irqoff (nr=nr@entry=2) at kernel/softirq.c:671
#2 0xc0855a34 in __netif_reschedule (q=0xc2027c00) at net/core/dev.c:3041
#3 __netif_schedule (q=q@entry=0xc2027c00) at net/core/dev.c:3048
#4 0xc085b0ec in qdisc_run_end (qdisc=0xc2027c00) at ./include/net/sch_generic.h:227
#5 qdisc_run (q=0xc2027c00) at ./include/net/pkt_sched.h:133
#6 net_tx_action (h=<optimized out>) at net/core/dev.c:5046
#7 0xc0101298 in __do_softirq () at kernel/softirq.c:558
#8 0xc0127cd0 in run_ksoftirqd (cpu=<optimized out>) at kernel/softirq.c:920
#9 0xc01487d0 in smpboot_thread_fn (data=0xc14a2780) at kernel/smpboot.c:164
#10 0xc0144b58 in kthread (_create=0xc14a2800) at kernel/kthread.c:319
#11 0xc0100130 in ret_from_fork () at arch/arm/kernel/entry-common.S:146
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

`net_tx_action` is running in `__do_softirq` and it will send package when
`qdisc_run`. But the emac driver in linux alway return NETDEV_TX_BUSY(16)
after emac_timeout due to we forget reset `db->tx_fifo_stat`,
that will make `__netif_schedule` raise softirq again and again.

qianfan Zhao (2):
drivers: net: sun4i-emac: Fix double spinlock in emac_timeout
drivers: net: sun4i-emac: Fix emac_timeout

drivers/net/ethernet/allwinner/sun4i-emac.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

--
2.25.1