Re: [PATCH v5 03/26] x86/hyperv: Update 'struct hv_enlightened_vmcs' definition

From: Vitaly Kuznetsov
Date: Mon Aug 22 2022 - 13:46:52 EST


Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@xxxxxxxxxx> writes:
>>
>> > On Mon, Aug 22, 2022, Vitaly Kuznetsov wrote:
>> >> So I reached out to Microsoft and their answer was that for all these new
>> >> eVMCS fields (including *PerfGlobalCtrl) observing architectural VMX
>> >> MSRs should be enough. *PerfGlobalCtrl case is special because of Win11
>> >> bug (if we expose the feature in VMX feature MSRs but don't set
>> >> CPUID.0x4000000A.EBX BIT(0) it just doesn't boot).
>> >
>> > I.e. TSC_SCALING shouldn't be gated on the flag? If so, then the 2-D array approach
>> > is overkill since (a) the CPUID flag only controls PERF_GLOBAL_CTRL and (b) we aren't
>> > expecting any more flags in the future.
>> >
>>
>> Unfortunately, we have to gate the presence of these new features on
>> something, otherwise VMM has no way to specify which particular eVMCS
>> "revision" it wants (TL;DR: we will break migration).
>>
>> My initial implementation was inventing 'eVMCS revision' concept:
>> https://lore.kernel.org/kvm/20220629150625.238286-7-vkuznets@xxxxxxxxxx/
>>
>> which is needed if we don't gate all these new fields on CPUID.0x4000000A.EBX BIT(0).
>>
>> Going forward, we will still (likely) need something when new fields show up.
>
> My comments from that thread still apply. Adding "revisions" or feature flags
> isn't maintanable, e.g. at best KVM will end up with a ridiculous number of flags.
>
> Looking at QEMU, which I strongly suspect is the only VMM that enables
> KVM_CAP_HYPERV_ENLIGHTENED_VMCS, it does the sane thing of enabling the capability
> before grabbing the VMX MSRs.
>
> So, why not simply apply filtering for host accesses as well?

(I understand that using QEMU to justify KVM's behavior is flawed but...)

QEMU's migration depends on the assumption that identical QEMU's command
lines create identical (from guest PoV) configurations. Assume we have
(simplified)

"-cpu CascadeLake-Sever,hv-evmcs"

on both source and destination but source host is newer, i.e. its KVM
knows about TSC Scaling in eVMCS and destination host has no idea about
it. If we just apply filtering upon vCPU creation, guest visible MSR
values are going to be different, right? Ok, assuming QEMU also migrates
VMX feature MSRs (TODO: check if that's true), we will be able to fail
mirgration late (which is already much worse than not being able to
create the desired configuration on destination, 'fail early') if we use
in-KVM filtering to throw an error to userspace. But if we blindly
filter control MSRs on the destination, 'TscScaling' will just disapper
undreneath the guest. This is unlikely to work.

In any case, what we need, is an option for VMM (read: QEMU) to create
the configuration with 'TscScaling' filtered out even KVM supports the
bit in eVMCS. This way the guest will be able to migrate backwards to an
older KVM which doesn't support it, i.e.

'-cpu CascadeLake-Sever,hv-evmcs'
creates the 'origin' eVMCS configuration, no TscScaling

'-cpu CascadeLake-Sever,hv-evmcs,hv-evmcs-2022' creates the updated one.

KVM_CAP_HYPERV_ENLIGHTENED_VMCS is bad as it only takes 'eVMCS' version
as a parameter (as we assumed it will always change when new fields are
added, but that turned out to be false). That's why I suggested
KVM_CAP_HYPERV_ENLIGHTENED_VMCS2.

For the issue at hand, 'hv-evmcs-2022' can just set CPUID.0x4000000A.EBX
BIT(0) and then we gate all new fields' existence on it. It doesn't
matter much if we filter host accesses or not in this scheme.

Going all the way back, I'd certainly made the filtering apply to host
writes throwing an error when eVMCS is enabled (and I'd made it per-VM
and mandate that it is enabled prior to getting MSRs) but that doesn't
seem to help us much now.

--
Vitaly