Re: CAKE and r8169 cause panic on upload in v4.19

From: Heiner Kallweit
Date: Fri Oct 26 2018 - 16:21:54 EST


On 26.10.2018 21:26, Oleksandr Natalenko wrote:
> Hello.
>
> I was excited regarding the fact that v4.19 introduced CAKE, so I've deployed it on my home router.
>
> I used this script of mine [1]:
>
> # bufferbloat enp3s0.100 20 20
>
> to do its job on the VLAN interface, where 20/20 ISP link is switched from the home switch. Basically, it just follows [2] with simple bandwidth restriction and egress mirroring using ifb.
>
> Then I thought it would be nice to run speedtest-cli on one of the computer in the home LAN, connected to this router. Download stage went fine, but immediately after upload started I've got a panic on the router: [3] (sorry, it is a photo, netconsole didn't work because, I assume, the panic happened in the networking code). I rebooted the router and tried once more, and got the same result, again during upload stage. Then I rebooted again, replaced CAKE script with my former HTB script, and after running speedtest-cli a couple of times there's no panic.
>
> Before running speedtest-cli I was using CAKE for a couple of days without generating much traffic just fine. It seems it crashes only if lots of traffic is generated with tools like this.
>
> My sysctl: [4] and ethtool -k: [5]
>
> So far, I've found something similar only here: [6] [7]. The common thing is r8169 driver in use, so, maybe, it is a driver issue, and CAKE is just happy to reveal it.
>
> If it is something known, please point me to a possible fix. If it is something new, I'm open to provide more info on your request, try patches etc (as usual).
>
It seems to be the same problem as described here: https://bugzilla.kernel.org/show_bug.cgi?id=201063
As I commented in bugzilla, the GPF in dev_hard_start_xmit and the values of R12/R15 make me think
that a poisoned list pointer is accessed. It's so deep in the network stack that I can not really
imagine the network driver is to blame. One screenshot attached to the bug report shows that the
GPF also happened with the igb driver. Most likely we find out only once somebody spends effort
on bisecting the issue.
d4546c2509b1 ("net: Convert GRO SKB handling to list_head.") and some subsequent changes deal with
skb list processing, maybe the issue is related to one of these changes.

> Thanks.
>