Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests

From: Radim KrÄmÃÅ
Date: Thu Mar 16 2017 - 10:08:25 EST


2017-03-16 09:24-0400, Gabriel L. Somlo:
> On Thu, Mar 16, 2017 at 01:41:28AM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 15, 2017 at 07:35:34PM -0400, Gabriel L. Somlo wrote:
> > > On Wed, Mar 15, 2017 at 11:22:18PM +0200, Michael S. Tsirkin wrote:
> > > > Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem:
> > > > unless explicitly provided with kernel command line argument
> > > > "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability,
> > > > without checking CPUID.
> > > >
> > > > We currently emulate that as a NOP but on VMX we can do better: let
> > > > guest stop the CPU until timer, IPI or memory change. CPU will be busy
> > > > but that isn't any worse than a NOP emulation.
> > > >
> > > > Note that mwait within guests is not the same as on real hardware
> > > > because halt causes an exit while mwait doesn't. For this reason it
> > > > might not be a good idea to use the regular MWAIT flag in CPUID to
> > > > signal this capability. Add a flag in the hypervisor leaf instead.
> > > >
> > > > Additionally, we add a capability for QEMU - e.g. if it knows there's an
> > > > isolated CPU dedicated for the VCPU it can set the standard MWAIT flag
> > > > to improve guest behaviour.
> > >
> > > Same behavior (on the mac pro 1,1 running F22 with custom-compiled
> > > kernel from kvm git master, plus this patch on top).
> > >
> > > The OS X 10.7 kernel hangs (or at least progresses extremely slowly)
> > > on boot, does not bring up guest graphical interface within the first
> > > 10 minutes that I waited for it. That, in contrast with the default
> > > nop-based emulation where the guest comes up within 30 seconds.
> >
> >
> > Thanks a lot, meanwhile I'll try to write a unit-test and experiment
> > with various behaviours.
> >
> > > I will run another round of tests on a newer Mac (4-year-old macbook
> > > air) and report back tomorrow.
> > >
> > > Going off on a tangent, why would encouraging otherwise well-behaved
> > > guests (like linux ones, for example) to use MWAIT be desirable to
> > > begin with ? Is it a matter of minimizing the overhead associated with
> > > exiting and re-entering L1 ? Because if so, AFAIR staying inside L1 and
> > > running guest-mode MWAIT in a tight loop will actually waste the host
> > > CPU without the opportunity to yield to some other L0 thread. Sorry if
> > > I fell into the middle of an ongoing conversation on this and missed
> > > most of the relevant context, in which case please feel free to ignore
> > > me... :)
> > >
> > > Thanks,
> > > --G
> >
> > It's just some experiments I'm running, I'm not ready to describe it
> > yet. I thought this part might be useful to at least some guests, so
> > trying to upstream it right now.
>
> OK, so on a macbook air running F25 and the latest kvm git master plus
> your v5 patch (4.11.0-rc2+), things appear to work.
>
> host-side cpuid output:
> eax=0x000040 ebx=0x000040 ecx=0x000003 edx=0x021120
>
> guest-side cpuid output:
> eax=00000000 ebx=00000000 ecx=0x000003 edx=00000000
>
> processor : 3
> vendor_id : GenuineIntel
> cpu family : 6
> model : 42
> model name : Intel(R) Core(TM) i7-2677M CPU @ 1.80GHz
> stepping : 7
> microcode : 0x29
> cpu MHz : 1157.849
> cache size : 4096 KB
> physical id : 0
> siblings : 4
> core id : 1
> cpu cores : 2
> apicid : 3
> initial apicid : 3
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
> bugs :
> bogomips : 3604.68
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
> After studying your patch a bit more carefully (sorry, it's crazy
> around here right now :) ) I realized you're simply trying to
> (selectively) decide when to exit L1 and emulate as NOP vs. when to
> just allow L1 to execute MONITOR & MWAIT natively.
>
> Is that right ? Because if so, the issues I saw on my MacPro1,1 are
> weird and inexplicable, given that allowing L>=1 to run MONITOR/MWAIT
> natively was one of the options Alex Graf and Rene Rebe used back in
> the very early days of OS X on QEMU, at the time I got involved with
> that project. Here's part of an out of tree patch against 3.4 which did
> just that, and worked as far as I remember on *any* MWAIT capable
> intel chip I had access to back in 2010:
>
> ##############################################################################
> # 99-mwait.patch.kvm-kmod (Rene Rebe <rene@xxxxxxxxxxxx>) 2010-04-27
> ##############################################################################
> diff -pNarU5 linux-3.4/arch/x86/kvm/cpuid.c linux-3.4-mac/arch/x86/kvm/cpuid.c
> --- linux-3.4/arch/x86/kvm/cpuid.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/cpuid.c 2012-10-09 11:42:59.921215750 -0400
> @@ -222,11 +222,11 @@ static int do_cpuid_ent(struct kvm_cpuid
> f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
> F(FXSR) | F(FXSR_OPT) | f_gbpages | f_rdtscp |
> 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
> /* cpuid 1.ecx */
> const u32 kvm_supported_word4_x86_features =
> - F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> + F(XMM3) | F(PCLMULQDQ) | F(MWAIT) /* DTES64, MONITOR */ |
> 0 /* DS-CPL, VMX, SMX, EST */ |
> 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> 0 /* Reserved, DCA */ | F(XMM4_1) |
> F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> diff -pNarU5 linux-3.4/arch/x86/kvm/svm.c linux-3.4-mac/arch/x86/kvm/svm.c
> --- linux-3.4/arch/x86/kvm/svm.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/svm.c 2012-10-09 11:44:41.598997481 -0400
> @@ -1102,12 +1102,10 @@ static void init_vmcb(struct vcpu_svm *s
> set_intercept(svm, INTERCEPT_VMSAVE);
> set_intercept(svm, INTERCEPT_STGI);
> set_intercept(svm, INTERCEPT_CLGI);
> set_intercept(svm, INTERCEPT_SKINIT);
> set_intercept(svm, INTERCEPT_WBINVD);
> - set_intercept(svm, INTERCEPT_MONITOR);
> - set_intercept(svm, INTERCEPT_MWAIT);
> set_intercept(svm, INTERCEPT_XSETBV);
>
> control->iopm_base_pa = iopm_base;
> control->msrpm_base_pa = __pa(svm->msrpm);
> control->int_ctl = V_INTR_MASKING_MASK;
> diff -pNarU5 linux-3.4/arch/x86/kvm/vmx.c linux-3.4-mac/arch/x86/kvm/vmx.c
> --- linux-3.4/arch/x86/kvm/vmx.c 2012-05-20 18:29:13.000000000 -0400
> +++ linux-3.4-mac/arch/x86/kvm/vmx.c 2012-10-09 11:42:59.925215977 -0400
> @@ -1938,11 +1938,11 @@ static __init void nested_vmx_setup_ctls
> nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
> nested_vmx_procbased_ctls_low = 0;
> nested_vmx_procbased_ctls_high &=
> CPU_BASED_VIRTUAL_INTR_PENDING | CPU_BASED_USE_TSC_OFFSETING |
> CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING |
> - CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
> + CPU_BASED_CR3_LOAD_EXITING |
> CPU_BASED_CR3_STORE_EXITING |
> #ifdef CONFIG_X86_64
> CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
> #endif
> CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> @@ -2404,12 +2404,10 @@ static __init int setup_vmcs_config(stru
> CPU_BASED_CR3_LOAD_EXITING |
> CPU_BASED_CR3_STORE_EXITING |
> CPU_BASED_USE_IO_BITMAPS |
> CPU_BASED_MOV_DR_EXITING |
> CPU_BASED_USE_TSC_OFFSETING |
> - CPU_BASED_MWAIT_EXITING |
> - CPU_BASED_MONITOR_EXITING |
> CPU_BASED_INVLPG_EXITING |
> CPU_BASED_RDPMC_EXITING;
>
> opt = CPU_BASED_TPR_SHADOW |
> CPU_BASED_USE_MSR_BITMAPS |
>
> If all you're trying to do is (selectively) revert to this behavior,
> that "shouldn't" mess it up for the MacPro either, so I'm thoroughly
> confused at this point :)
>
> Back in 2010, running MWAIT in L>=1 behaved 100% exactly like a NOP,
> didn't power down the physical CPU, just immediately moved on to the
> next instruction. As such, there was no power saving and no
> opportunity to yield to another L0 thread either, unlike with NOP
> emulation at L0.
>
> Did that change on newer Intel chips (i.e., is guest-mode MWAIT now
> doing something smarter than just acting as a guest-mode NOP) ?

Probably, MWAIT in new intel chips enters power saving mode normally.

If hardware-executed MWAIT acted as a NOP in your old chip, then that
shouldn't be a problem either ... Maybe OS X gets confused into doing
something really dumb because we do not expose the MONITOR/MWAIT feature
bit correctly.

Can you try this QEMU patch on the old hardware?

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 7aa762245a54..4b112e12188a 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -2764,10 +2764,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
break;
case 5:
/* mwait info: needed for Core compatibility */
- *eax = 0; /* Smallest monitor-line size in bytes */
- *ebx = 0; /* Largest monitor-line size in bytes */
- *ecx = CPUID_MWAIT_EMX | CPUID_MWAIT_IBE;
- *edx = 0;
+ host_cpuid(index, 0, eax, ebx, ecx, edx);
break;
case 6:
/* Thermal and Power Leaf */
diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 55865dbee0aa..1eb78291b093 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -360,6 +360,7 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
if (!kvm_irqchip_in_kernel()) {
ret &= ~CPUID_EXT_X2APIC;
}
+ ret |= CPUID_EXT_MONITOR;
} else if (function == 6 && reg == R_EAX) {
ret |= CPUID_6_EAX_ARAT; /* safe to allow because of emulated APIC */
} else if (function == 7 && index == 0 && reg == R_EBX) {


Thanks.