Re: stable/LTS test report from KernelCI (2023-12-08)

From: Guillaume Tucker
Date: Mon Dec 11 2023 - 05:14:14 EST


On 08/12/2023 16:58, Greg KH wrote:
> On Fri, Dec 08, 2023 at 12:29:35PM -0300, Gustavo Padovan wrote:
>> Hello,
>>
>> As discussed with Greg at LPC, we are starting an iterative process to
>> deliver meaningful stable test reports from KernelCI. As Greg pointed out,
>> he doesn't look at the current reports sent automatically from KernelCI.
>> Those are not clean enough to help the stable release process, so we
>> discussed starting over again.
>>
>> This reporting process is a learning exercise, growing over time. We are
>> starting small with data we can verify manually (at first) to make sure we
>> are not introducing noise or reporting flakes and false-positives. The
>> feedback loop will teach us how to filter the results and report with
>> incremental automation of the steps.
>>
>> Today we are starting with build and boot tests (for the hardware platforms
>> in KernelCI with sustained availability over time). Then, at every iteration
>> we try to improve it, increasing the coverage and data visualization.
>> Feedback is really important. Eventually, we will also have this report
>> implemented in the upcoming KernelCI Web Dashboard.
>>
>> This work is a contribution from Collabora(on behalf of its clients) to
>> improve the Kernel Integration as whole. Moving forward, Shreeya Patel, from
>> the Collabora team will be taking on the responsibilities of delivering
>> these reports.
>>
>> Without further ado, here's our first report:
>>
>>
>> ## stable-rc HEADs:
>>
>> Date: 2023-12-08
>> 6.1: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/log/?h=45deeed0dade29f16e1949365688ea591c20cf2c
>> 5:15: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/log/?h=e5a5d1af708eced93db167ad55998166e9d893e1
>> 5.10: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/log/?h=ce575ec88a51a60900cd0995928711df8258820a
>> 5:4: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/log/?h=f47279cbca2ca9f2bbe1178634053024fd9faff3
>>
>> * 6.6 stable-rc was not added in KernelCI yet, but we plan to add it next
>> week.
>>
>>
>> ## Build failures:
>>
>> No build failures seen for the stable-rc/queue commit heads for
>> 6.1/5.15/5.10/5.4  \o/
>>
>>
>> ## Boot failures:
>>
>> No **new** boot failures seen for the stable-rc/queue commit heads for
>> 6.1/5.15/5.10/5.4  \o/
>>
>> (for the time being we are leaving existing failures behind)
>>
>>
>> ## Considerations
>>
>> All this data is available in the legacy KernelCI Web Dashboard -
>> https://linux.kernelci.org/ - but not easily filtered there. The data in
>> this report was checked manually. As we evolve this report, we want to add
>> traceability of the information, making it really easy for anyone to dig
>> deeper for more info, logs, etc.
>>
>> The report covers  the hardware platforms in KernelCI with sustained
>> availability over time - we will detail this further in future reports.
>>
>> We opted to make the report really simple as you can see above. It is just
>> an initial spark. From here your feedback will drive the process. So really
>> really tell us what you want to see next. We want FEEDBACK!
>
> Looks great!
>
> A few notes, it can be a bit more verbose if you want :)
>
> One email per -rc release (i.e. one per branch) is fine, and that way if
> you add a:
> Tested-by: kernelci-bot <email goes here>
> or something like that, to the email, my systems will pick it up and it
> will get added to the final commit message.
>
> But other than that, hey, I'll take the above, it's better than what was
> there before!
>
> How about if something breaks, what will it look like? That's where it
> gets more "interesting" :)

Brings back some memories, 5.10.20-rc2 :)

https://lore.kernel.org/stable/32a6c609-642c-71cf-0a84-d5e8ccd104b1@xxxxxxxxxxxxx/

I see some people are working in my footsteps now, it'll be
interesting to see if they reach the same conclusions about how
to automate these emails and track regressions. I guess it's
hard to convince others that the solutions we now know we need to
put in place are going to solve this, so everyone has to do the
journey themselves. Maybe that's part of upstream development,
not always removing duplication of efforts.


Here's some feedback in general:

* Showing what is passing is mostly noise

As Greg pointed out, what's important is the things that are
broken (so new regressions). For stable, I think we also
established that it was good to keep a record of all the things
that were tested and passed, but it's not too relevant when
gating releases. See the other manual emails sent by Shuah,
Guenter and some Linaro folks for example.

* Replying to the stable review

This email is a detached thread, I know it's a draft and just a
way to discuss things, but obviously a real report would need to
be sent as a reply to the patch review thread using stable-rc.

On a related topic, it was once mentioned that since stable
releases occur once a week and they are used as the basis for
many distros and products, it would make sense to have
long-running tests after the release has been declared. So we
could have say, 48h of testing with extended coverage from LTP,
fstests, benchmarks etc. That would be a reply to the email with
the release tag, not the patch review.

For the record, a few years ago, KernelCI used to reply to the
review threads on the list. Unfortunately this broke at some
point, mostly because the legacy system is too bloated and hard
to maintain, and now it's waiting to be enabled again with the
new API. Here's one example, 4.4-202 in 2019 a bit before it
stopped:

https://lore.kernel.org/stable/5dce97f3.1c69fb81.6633c.685c@xxxxxxxxxxxxx/

* Automation

And also obviously, doing this by hand isn't really practical.
It's OK for a maintainer looking just at a small amount of
results, but for KernelCI it would take maybe 2h per stable
release candidate for a dedicated person to look at all the
regressions etc. So discussing the format and type of content is
more relevant at this stage I think, while the automated data
harvesting part gets implemented in the background. And of
course, we need the new API in production for this to be actually
enabled - so still a few months away from now.

I've mentioned before the concept of finding "2nd derivatives" in
the rest results, basically the first delta gives you all the
regressions and then you do a delta of the regressions to find
the new ones. Maintainer trees would be typically comparing
against mainline or say, the -rc2 tag where they based their
branch. In the case of stable, it would be between the stable-rc
branch being tested and the base stable branch with the last
tagged release.


But hey, I'm not a stable maintainer :) This is merely a summary
of what I recall from the past few years of discussions and what
I believe to be the current consensus on what people wanted to do
next.

One last thing, I see there's a change in KernelCI now to
actually stop sending the current (suboptimal) automated reports
to the stable mailing list:

https://github.com/kernelci/kernelci-jenkins/pull/136

Is this actually what people here want? I would argue that we
need the new reports first before deliberately stopping the old
ones. Maybe I missed something, it just felt a bit arbitrary.
Some folks might actually be reading these emails, if we wanted
to stop them we probably should first send a warning about when
they'll stop etc. Anyway, I'll go back under my rock for now :)

Cheers,
Guillaume