Re: [BUG] Guest OSes die simultaneously (bisected)

From: Paolo Bonzini
Date: Thu Jan 04 2024 - 11:33:14 EST


On 1/4/24 17:06, Paul E. McKenney wrote:
Although I am happy to have been able to locate the commit (and even
happier that Sean spotted the problem and that you quickly pushed the
fix to mainline!), chasing this consumed a lot of time and systems over
an embarrassingly large number of months. As in I first spotted this
bug in late July. Despite a number of increasingly complex attempts,
bisection became feasible only after the buggy commit was backported to
our internal v5.19 code base. 🙁

Yes, this strikes two sore points.

One is that I have also experienced being able to bisect only with a somewhat more linear history (namely the CentOS Stream 9 aka c9s frankenkernel [1]) and not with upstream. Even if the c9s kernel is not a fully linear set of commits, there's some benefit from merge commits that consist of slightly more curated set of patches, where each merge commit includes both new features and bugfixes. Unfortunately, whether you'll be able to do this with the c9s kernel depends a lot on the subsystems involved and on the bug. Both are factors that may or may not be known in advance.

The other, of course, is testing. The KVM selftests infrastructure is meant for this kind of white box problem, but the space of tests that can be written is so large, that there's always too few tests. It shines when you have a clear bisection but an unclear fix (in the past I have had cases where spending two days to write a test led me to writing a fix in thirty minutes), but boosting the reproducibility is always a good thing.

And please understand that I am not casting shade on those who wrote,
reviewed, and committed that buggy commit. As in I freely confess that
I had to stare at Sean's fix for a few minutes before I figured out what
was going on.

Oh don't worry about that---rather, I am going to cast a shade on those that did not review the commit, namely me. I am somewhat obsessed with Boolean logic and *probably* I would have caught it, or would have asked to split the use of designated initializers to a separate patch. Any of the two could, at least potentially, have saved you quite some time.

Instead, the point I am trying to make is that carefully
constructed tests can serve as tireless and accurate code reviewers.
This won't ever replace actual code review, but my experience indicates
that it will help find more bugs more quickly and more easily.

TBH this (conflict between virtual addresses on the host and the guest leading to corruption of the guest) is probably not the kind of adversarial test that one would have written or suggested right off the bat. But it should be written now indeed.

Paolo

[1] https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/