Re: Kernel panic in netif_rx_internal after v6 pings between netns

From: Matthieu Baerts
Date: Mon Jan 22 2024 - 13:56:31 EST


Hi Jakub,

On 22/01/2024 18:28, Jakub Kicinski wrote:

(...)

> Somewhat related. What do you do currently to ignore crashes?

I was wondering why you wanted to ignore crashes :) ... but then I saw
the new "Test ignored" and "Crashes ignored" sections on the status
page. Just to be sure: you don't want to report issues that have not
been introduced by the new patches, right?

We don't need to do that on MPTCP side:
- either it is a new crash with patches that are in reviewed and that's
not impacting others → we test each series individually, not a batch of
series.
- or there are issues with recent patches, not in netdev yet → we fix,
or revert.
- or there is an issue elsewhere, like the kernel panic we reported
here: usually I try to quickly apply a workaround, e.g. applying a fix,
or a revert. I don't think we ever had an issue really impacting us
where we couldn't find a quick solution in one or two days. With the
panic we reported here, ~15% of the tests had an issue, that's "OK" to
have that for a few days/weeks

With fewer tests and a smaller community, it is easier for us to just
say on the ML and weekly meetings: "this is a known issue, please ignore
for the moment". But if possible, I try to add a workaround/fix in our
repo used by the CI and devs (not upstreamed).

For NIPA CI, do you want to do like with the build and compare with a
reference? Or multiple ones to take into account unstable tests? Or
maintain a list of known issues (I think you started to do that,
probably safer/easier for the moment)?

> I was seeing a lot of:
> https://netdev-2.bots.linux.dev/vmksft-net-mp/results/431181/vm-crash-thr0-2
>
> So I hacked up this function to filter the crash from NIPA CI:
> https://github.com/kuba-moo/nipa/blob/master/contest/remote/lib/vm.py#L50
> It tries to get first 5 function names from the stack, to form
> a "fingerprint". But I seem to recall a discussion at LPC's testing
> track that there are existing solutions for generating fingerprints.
> Are you aware of any?

No, sorry. But I guess they are using that with syzkaller, no?

I have to admit that crashes (or warnings) are quite rare, so there was
no need to have an automation there. But if it is easy to have a
fingerprint, I will be interested as well, it can help for the tracking:
to find occurrences of crashes/warnings that are very hard to reproduce.

> (FWIW the crash from above seems to be gone on latest linux.git,
> this night's CIs run are crash-free.)

Good it was quickly fixed!

Cheers,
Matt
--
Sponsored by the NGI0 Core fund.