Re: [PATCH] KVM: x86: Allow XSAVES on CPUs where host doesn't use it due to an errata

From: Maciej S. Szmigiero
Date: Thu Nov 30 2023 - 17:01:03 EST


On 30.11.2023 18:24, Maxim Levitsky wrote:
On Mon, 2023-11-27 at 09:24 -0800, Sean Christopherson wrote:
On Thu, Nov 23, 2023, Maciej S. Szmigiero wrote:
From: "Maciej S. Szmigiero" <maciej.szmigiero@xxxxxxxxxx>

Since commit b0563468eeac ("x86/CPU/AMD: Disable XSAVES on AMD family 0x17")
kernel unconditionally clears the XSAVES CPU feature bit on Zen1/2 CPUs.

Since KVM CPU caps are initialized from the kernel boot CPU features this
makes the XSAVES feature also unavailable for KVM guests in this case, even
though they might want to decide on their own whether they are affected by
this errata.

Allow KVM guests to make such decision by setting the XSAVES KVM CPU
capability bit based on the actual CPU capability

This is not generally safe, as the guest can make such a decision if and only if
the Family/Model/Stepping information is reasonably accurate.

Another thing that really worries me is that the XSAVES errata is really nasty one -
AFAIK it silently corrupts some registers.

It's not unconditional state corruption, but corruption in specific set
of conditions, all of which have to be true for it to occur [1]:
* All XMM registers were restored to the initialization value by the most
recent XRSTORS instruction because the XSTATE_BV[SSE] bit was clear.
* The state save area for the XMM registers does not contain the
initialization state.
* The value in the XMM registers match the initialization value when the
XSAVES instruction is executed.
* The MXCSR register has been modified to a value different from the
initialization value since the most recent XRSTORS instruction.

According to [2] this issue was fixed in the microcode update released on
2022-08-09.
[2] also says it is not present anymore in (at least) version 0x08301055.

Is it better to let a broken CPU boot a broken OS (OS which demands XSAVES blindly),
and let a silent data corruption happen than refuse to boot it completely?

It is possible that, for example, Windows only uses safe subset of this
instruction or just verifies its presence but doesn't actually use it -
it's Hyper-V (L1) that throws this HV_STATUS_CPUID_XSAVE_FEATURE_VALIDATION_ERROR
but I presume it's Windows (L2) kernel which chooses which XSAVE-family
variant to actually use.

At least in the Linux guest case the guest won't use XSAVES anyway due
to this errata.

For other guests we also don't make situation any worse than on bare metal
- if they would use XSAVES anyway in KVM they would do it when running on
bare metal too.

I mean I understand that it is technically OS fault in this case (assuming that we
do provide it the correct CPU family info), but still this seems like the wrong thing to do.

I guess this is one of those few cases when it makes sense for the userspace to
override KVM's CPUID caps and force a feature - in this case at least that
won't be KVM's fault.

I am not against making the decision in QEMU instead of doing this in KVM,
but as I said to Sean it looks like this will still require some KVM
changes since KVM seems to make various decisions depending on presence
of XSAVES bit in KVM caps and boot_cpu_has(XSAVES) rather that exclusively
based on what VMM has set in CPUID.

That's why some KVM changes will be necessary even if the actual decision
logic will be in QEMU.

My point is to make this work out-of-the box for QEMU "-cpu host" and
similar CPU models that have support for XSAVES.

The reason for this is simple:
It's not like such Windows guests throw a big error screen saying
"Please enable XSAVES or disable XSAVEC for successful boot".

Instead, they simply hang at boot leaving the user wondering what could
be wrong.

Users can get very frustrated with situation like this since they don't
know what to do - on Intel side of things look for example how people
are unable to boot recent Windows versions (both server and client) on
Intel Core 12th gen or later in KVM and how people still try random
things to fix it [3] (it's Hyper-V being picky about extended topology
information in CPUID btw).

Cloud providers often know which guest type is going to be launched in
a VM so there's no problem adding a extra QEMU "-cpu" flag for
particular cloud VM.

But for individual users having to guess which magic flags are necessary
to make particular KVM guest work makes for miserable user experience.

I think that if particular guest would work on bare metal it should
work on "-cpu host" too - no tinkering should be required for such
basic functionality as being able to successfully finish booting.


Best regards,
Maxim Levitsky

Thanks,
Maciej

[1]: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/revision-guides/56323-PUB_1_01.pdf
[2]: https://google.github.io/security-research/pocs/cpus/errata/amd/1386/
[3]: https://lore.kernel.org/kvm/MN2PR12MB3023F67FF37889AB3E8885F2A0849@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
https://bugzilla.kernel.org/show_bug.cgi?id=217307
https://forums.unraid.net/topic/131838-windows-11-virtual-machine-platform-wsl2-boot-loop/