Re: Kernel regression tracking/reporting initiatives and KCIDB

From: Ricardo Cañuelo
Date: Fri Aug 18 2023 - 03:53:19 EST


Hi,

On jue, ago 17 2023 at 15:32:21, Guillaume Tucker <guillaume.tucker@xxxxxxxxxxxxx> wrote:
> With the new API, data is owned by the users who submit it so we can
> effectively provide a solution for grouping data from multiple CI
> systems like KCIDB does.
>
> The key thing here is that KernelCI as a project will be
> providing a database with regression information collected from
> any public CI system.

Does this mean that KernelCI will replace KCIDB? or will they both keep
working separately?

> So the topic of tracking regressions for the whole kernel is already
> part of the roadmap for KernelCI, and if just waiting for CI systems
> to push data is not enough we can have services that actively go and
> look for regressions to feed them into the database under a particular
> category (or user).
> It would be good to align ideas you may have with KernelCI's
> plans

Our ideas start by studying the required features and needs for
regression analysis, reporting and tracking in a general and
system-agnostic way. First the concepts, then the implementation. I
think that analyzing the problem from the specific perspective of
KernelCI (or any other CI system in particular). If we start with a
general approach we can always specialize it later to a particular
implementation, but starting a design with a restricted design in mind,
tailored to a specific system, will probably tie it to that system
permanently.

IMO the work we want to do with regressions should be higher-level,
based on the data produced by a CI system (any of them) and not
dependent on any particular implementation.

> also please take into account the fact that the current
> Regression tracker you've created relies on the legacy system
> which is going to be retired in the coming months.

That's correct. The regression tracker started as a proof of concept to
explore ideas and we based it on KernelCI test data. We're aware that
the legacy system will be retired soon, that's why we want to look into
KCIDB as a data source.

>> - did this test also fail on other hardware targets or with other kernel
>> configurations?
>> - is it possible that the test failed because of an infrastructure
>> error?
>
> This should be treated as a false-positive failing test rather
> than a "regression". But yes of course we need to deal with
> them, it's just slightly off-topic here I think.

Not regressions, that's right, but I don't think these should be simply
categorized as false-positives. If we treated these two particular cases
as false positives we would be hiding and missing important results:

- If the same test case on the same kernel version failed with different
configurations or in other boards, highlighting that information could
help narrow down the investigation or point it to the right
direction. There's definitely a failure (probablyl not a regression)
but the thing to fix might not be a kernel code commit but the
configuration used for the test. This can be submitted to the test
authors or the maintainers of the CI system running the test.

- If the test failed because of an infrastructure error, that's
something that can be reported to the lab maintainers to fix. This can
be done automatically.

>> - does the test fail consistently since that commit or does it show
>> unstable results?
>> - does the test output show any traces of already known bugs?
>> - has this regression been bisected and reported anywhere?
>> - was the regression reported by anyone? If so, is there someone already
>> working on it?
>
> These are all part of the post-regression checks we've been
> discussing to run as part of KernelCI. Basically, extending from
> the current automated bisection jobs we have and also taking into
> account the notion of dynamic scheduling. However, when
> collecting data from other CI systems I don't think there is much
> we can do if the data is not there. But we might be able to
> create collaborations to run extra post-regression checks in
> other CI systems to tackle this.

This is why I think handling this at a higher level, once all the test
data from multiple CI systems has been collected, could be the right
strategy. Can't these post-regression checks be applied to a common DB
with results aggregated from different CI systems? As long as the
results are collected in a common and standard way, I mean. We could
have those checks implemented only once, in a centralized and generic
way, instead of having a different implementation of the same process in
each of the data sources.

> Experimenting with KCIDB now may be interesting, but depending on
> the outcome of the discussions around having one central database
> for KernelCI it might not be the optimal way to do it.

Why not? Sorry, I might not have the full context, can you or Nikolai
give a bit more insight about the possible future status of KCIDB and
KernelCI and the relationship between them?

Thanks,
Ricardo