Re: Kernel panic in netif_rx_internal after v6 pings between netns

From: Matthieu Baerts
Date: Mon Jan 22 2024 - 14:16:28 EST

Next message: tip-bot2 for Tony Luck: "[tip: x86/cache] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()"
Previous message: Breno Leitao: "Re: [PATCH net-next 04/22] net: fill in MODULE_DESCRIPTION()s for Broadcom bgmac"
In reply to: Jakub Kicinski: "Re: Kernel panic in netif_rx_internal after v6 pings between netns"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 22/01/2024 19:36, Jakub Kicinski wrote:
> On Mon, 22 Jan 2024 19:22:42 +0100 Matthieu Baerts wrote:
>>> Somewhat related. What do you do currently to ignore crashes?
>>
>> I was wondering why you wanted to ignore crashes :) ... but then I saw
>> the new "Test ignored" and "Crashes ignored" sections on the status
>> page. Just to be sure: you don't want to report issues that have not
>> been introduced by the new patches, right?
>
> Initially, yes, but going forward I bet we'll always see crashes and
> breakage introduced downstream. So we need some knobs to selectively
> silence failing things.

Even if I guess it will mainly be around the merge window, I understand
it will be annoying to have issues for at least one week, until the next
sync with Linus' tree.

> In an ideal world we'd also have some form of "last seen" stat
> displayed to know when to retire these entries..

Good idea!

>> We don't need to do that on MPTCP side:
>> - either it is a new crash with patches that are in reviewed and that's
>> not impacting others → we test each series individually, not a batch of
>> series.
>> - or there are issues with recent patches, not in netdev yet → we fix,
>> or revert.
>> - or there is an issue elsewhere, like the kernel panic we reported
>> here: usually I try to quickly apply a workaround, e.g. applying a fix,
>> or a revert. I don't think we ever had an issue really impacting us
>> where we couldn't find a quick solution in one or two days. With the
>> panic we reported here, ~15% of the tests had an issue, that's "OK" to
>> have that for a few days/weeks
>>
>> With fewer tests and a smaller community, it is easier for us to just
>> say on the ML and weekly meetings: "this is a known issue, please ignore
>> for the moment". But if possible, I try to add a workaround/fix in our
>> repo used by the CI and devs (not upstreamed).
>>
>> For NIPA CI, do you want to do like with the build and compare with a
>> reference? Or multiple ones to take into account unstable tests? Or
>> maintain a list of known issues (I think you started to do that,
>> probably safer/easier for the moment)?
>
> Exactly - where we can a before/after diff is the best. We do that for
> all static checker / building kind of tests. But for selftests I'm not
> sure how effective and applicable that is. Even the stack trace I
> posted here happens somewhat unreliably :( We can try to develop more
> intelligent ways going forward, obviously :)

Maybe just by checking the last X rans, instead of the last one?

Maybe also enough to just mark the tests as WARN with "was failing
before" text (+ a percentage)?

>>> I was seeing a lot of:
>>> https://netdev-2.bots.linux.dev/vmksft-net-mp/results/431181/vm-crash-thr0-2
>>>
>>> So I hacked up this function to filter the crash from NIPA CI:
>>> https://github.com/kuba-moo/nipa/blob/master/contest/remote/lib/vm.py#L50
>>> It tries to get first 5 function names from the stack, to form
>>> a "fingerprint". But I seem to recall a discussion at LPC's testing
>>> track that there are existing solutions for generating fingerprints.
>>> Are you aware of any?
>>
>> No, sorry. But I guess they are using that with syzkaller, no?
>>
>> I have to admit that crashes (or warnings) are quite rare, so there was
>> no need to have an automation there. But if it is easy to have a
>> fingerprint, I will be interested as well, it can help for the tracking:
>> to find occurrences of crashes/warnings that are very hard to reproduce.
>
> Indeed, I'll keep my ear to the ground. I believe it was discussed in
> relation to KCIDB.

Maybe good to ask people behind Syzkaller, they must have something in
place :)

Cheers,
Matt
--
Sponsored by the NGI0 Core fund.

Next message: tip-bot2 for Tony Luck: "[tip: x86/cache] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()"
Previous message: Breno Leitao: "Re: [PATCH net-next 04/22] net: fill in MODULE_DESCRIPTION()s for Broadcom bgmac"
In reply to: Jakub Kicinski: "Re: Kernel panic in netif_rx_internal after v6 pings between netns"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]