Re: [RFC 00/26] Intel Thread Director Virtualization

From: Zhao Liu
Date: Thu Feb 22 2024 - 02:29:28 EST


Ping Paolo & Sean,

Do you have any comment? Or do you think ITD virtualization is
appropriate to discuss at PUCK?

Thanks,
Zhao

On Sat, Feb 03, 2024 at 05:11:48PM +0800, Zhao Liu wrote:
> Date: Sat, 3 Feb 2024 17:11:48 +0800
> From: Zhao Liu <zhao1.liu@xxxxxxxxxxxxxxx>
> Subject: [RFC 00/26] Intel Thread Director Virtualization
> X-Mailer: git-send-email 2.34.1
>
> From: Zhao Liu <zhao1.liu@xxxxxxxxx>
>
> Hi list,
>
> This is our RFC to virtualize Intel Thread Director (ITD) feature for
> Guest, which is based on Ricardo's patch series about ITD related
> support in HFI driver ("[PATCH 0/9] thermal: intel: hfi: Prework for the
> virtualization of HFI" [1]).
>
> In short, the purpose of this patch set is to enable the ITD-based
> scheduling logic in Guest so that Guest can better schedule Guest tasks
> on Intel hybrid platforms.
>
> Currently, ITD is necessary for Windows VMs. Based on ITD virtualization
> support, the Windows 11 Guest could have significant performance
> improvement (for example, on i9-13900K, up to 14%+ improvement on
> 3DMARK).
>
> Our ITD virtualization is not bound to VMs' hybrid topology or vCPUs'
> CPU affinity. However, in our practice, the ITD scheduling optimization
> for win11 VMs works best when combined with hybrid topology and CPU
> affinity (this is related to the specific implementation of Win11
> scheduling). For more details, please see the Section.1.2 "About hybrid
> topology and vCPU pinning".
>
> To enable ITD related scheduling optimization in Win11 VM, some other
> thermal related support is also needed (HWP, CPPC), but we could emulate
> it with dummy value in the VMM (We'll also be sending out extra patches
> in the future for these).
>
> Welcome your feedback!
>
>
> 1. Background and Motivation
> ============================
>
> 1.1. Background
> ^^^^^^^^^^^^^^^
>
> We have the use case to run games in the client Windows VM as the cloud
> gaming solution.
>
> Gaming VMs are performance-sensitive VMs on Client, so that they usually
> have two characteristics to ensure interactivity and performance:
>
> i) There will be vCPUs equal to or close to the number of Host pCPUs.
>
> ii) The vCPUs of Gaming VM are often bound to the pCPUs to achieve
> exclusive resources and avoid the overhead of migration.
>
> In this case, Host can't provide effective scheduling for Guest, so we
> need to deliver more hardware-assisted scheduling capabilities to Guest
> to enhance Guest's scheduling.
>
> Windows 11 (and future Windows products) is heavily optimized for the
> Intel hybrid platform. To get the best performance, we need to
> virtualize hybrid scheduling features (HFI/ITD) for Windows Guest.
>
>
> 1.2. About hybrid topology and vCPU pinning
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Our ITD virtualization can support most vCPU topologies (except multiple
> packages/dies, see details in 3.5 Restrictions on Guest Topology), and
> can also support the case of non-pinning vCPUs (i.e. it can handle vCPU
> thread migration).
>
> The following is our performance measuremnt on an i9-13900K machine
> (2995Mhz, 24Cores, 32Thread(8+16) RAM: 14GB (16GB Physical)), with
> iGPU passthrough, running 3DMARK in Win11 Professional Guest:
>
>
> compared with smp topo case smp topo smp topo smp topo hybrid topo hybrid topo hybrid topo hybrid topo
> + affinity + ITD + ITD + affinity + ITD + ITD
> + affinity + affinity
> Time Spy - Overall 0.179% -0.250% 0.179% -0.107% 0.143% -0.179% -0.107%
> Graphics score 0.124% -0.249% 0.124% -0.083% 0.124% -0.166% -0.249%
> CPU score 0.916% -0.485% 1.149% -0.076% 0.722% -0.324% 11.915%
> Fire Strike Extreme - Overall 0.149% 0.000% 0.224% -1.021% -3.361% -1.319% -3.361%
> Graphics score 0.100% 0.050% 0.150% -1.376% -3.427% -1.676% -3.652%
> Physics score 5.060% 0.759% 0.518% -2.907% -10.914% -0.897% 14.638%
> Combined score 0.120% -0.179% 0.418% 0.060% -2.929% -0.179% -2.809%
> Fire Strike - Overall 0.350% -0.085% 0.193% -1.377% -1.365% -1.509% -1.787%
> Graphics score 0.256% -0.047% 0.210% -1.527% -1.376% -1.504% -2.320%
> Physics score 3.695% -2.180% 0.629% -1.581% -6.846% -1.444% 14.100%
> Combined score 0.415% -0.128% 0.128% -0.957% -1.052% -1.594% -0.957%
> CPU Profile Max Threads 1.836% 0.298% 1.786% -0.069% 1.545% 0.025% 9.472%
> 16 Threads 4.290% 0.989% 3.588% 0.595% 1.580% 0.848% 11.295%
> 8 Threads -22.632% -0.602% -23.167% -0.988% -1.345% -1.340% 8.648%
> 4 Threads -21.598% 0.449% -21.429% -0.817% 1.951% -0.832% 2.084%
> 2 Threads -12.912% -0.014% -12.006% -0.481% -0.609% -0.595% 1.161%
> 1 Threads -3.793% -0.137% -3.793% -0.495% -3.189% -0.495% 1.154%
>
>
> Based on the above result, we can find exposing only HFI/ITD to win11
> VMs without hybrid topology or CPU affinity (case "smp topo + ITD")
> won't hurt performance, but would also not get any performance
> improvement.
>
> Setting both hybrid topology and CPU affinity for ITD, then win11 VMs
> get significate performance improvement (up to 14%+, compared with the
> case setting smp topology without CPU affinity).
>
> Not only the numerical results of 3DMARK, but in practice, there is an
> significate improvement in the frame rate of the games.
>
> Also, the more powerful the machine, the more significate the
> performance gains!
>
> Therefore, the best practice for enabling ITD scheduling optimization
> is to set up both CPU affinity and hybrid topology for win11 Guest while
> enabling our ITD virtualization.
>
> Our earlier QEMU prototype RFC [2] presented the initial hybrid
> topology support for VMs. And currently our another proposal about
> "QOM topology" [3] has been raised in the QEMU community, which is the
> first step towards the hybrid topology implementation based on QOM
> approach.
>
>
> 2. Introduction of HFI and ITD
> ==============================
>
> Intel provides Hardware Feedback Interface (HFI) feature to allow
> hardware to provide guidance to the OS scheduler to perform optimal
> workload scheduling through a hardware feedback interface structure in
> memory [4]. This HFI structure is called HFI table.
>
> For now, the guidance includes performance and energy efficiency
> hints, and it could be update via thermal interrupt as the actual
> operating conditions of the processor change during run time.
>
> Intel Thread Director (ITD) feature extends the HFI to provide
> performance and energy efficiency data for advanced classes of
> instructions.
>
> Since ITD is an extension of HFI, our ITD virtualization also
> virtualizes the native HFI feature.
>
>
> 3. Dependencies of ITD
> ======================
>
> ITD is a thermal FEATURE that requires:
> * PTM (Package Thermal Management, alias, PTS)
> * HFI (Hardware Feedback Interface)
>
> In order to support the notification mechanism of ITD/HFI dynamic
> update, we also need to add thermal interrupt related support,
> including the following two features:
> * ACPI (Thermal Monitor and Software Controlled Clock Facilities)
> * TM (Thermal Monitor, alias, TM1/ACC)
>
> Therefore, we must also consider support for the emulation of all
> the above dependencies.
>
>
> 3.1. ACPI emulation
> ^^^^^^^^^^^^^^^^^^^
>
> For both ACPI, we can support it by emulating the RDMSR/WRMSR of the
> associated MSRs and adding the ability to inject thermal interrupts.
> But in fact, we don't really inject termal interrupts into Guest for
> the termal conditions corresponding to ACPI. Here the termal interrupt
> is prepared for the subsequent HFI/ITD.
>
>
> 3.2. TM emulation
> ^^^^^^^^^^^^^^^^^
>
> TM is a hardware feature and its CPUID bit only indicates the presence
> of the automatic thermal monitoring facilities. For TM, there's no
> interactive interface between OS and hardware, but its flag is one of
> the prerequisites for the OS to enable thermal interrupt.
>
> Thereby, as the support for TM, it is enough for us to expose its CPUID
> flag to Guest.
>
>
> 3.3. PTM emulation
> ^^^^^^^^^^^^^^^^^^
>
> PTM is a package-scope feature that includes package-level MSR and
> package-level thermal interrupt. Unfortunately, KVM currently only
> supports thread-scope MSR handling, and also doesn't care about the
> specific Guest's topology.
>
> But considering that our purpose of supporting PTM in KVM is to further
> support ITD, and the current platforms with ITD are all 1 package, so we
> emulate the MSRs of the package scope provided by PTM at the VM level.
>
> In this way, the VMM is required to set only one package topology for
> the PTM. In order to alleviate this limitation, we only expose the PTM
> feature bit to Guest when ITD needs to be supported.
>
>
> 3.4. HFI emulation
> ^^^^^^^^^^^^^^^^^^
>
> ITD is the extension of HFI, so both HFI and ITD depend on HFI table.
> HFI itself is used on the Host for power-related management control, so
> we should only expose HFI to Guest when we need to enable ITD.
>
> HFI also relies on PTM interrupt control, so it also has requirements
> for package topology, and we also emulate HFI (including ITD) at the VM
> level.
>
> In addition, because the HFI driver allocates HFI instances per die,
> this also affects HFI (and ITD) and must limit the Guest to only set one
> die.
>
>
> 3.5. Restrictions on Guest Topology
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Due to KVM's incomplete support for MSR topology and the requirement for
> HFI instance management in the kernel, PTM, HFI, and ITD limit the
> topology of the Guest (mainly restricting the topology types created on
> the VMM side).
>
> Therefore, we only expose PTM, HFI, and ITD to userspace when we need to
> support ITD. At the same time, considering that currently, ITD is only
> used on the client platform with 1 package and 1 die, such temporary
> restrictions will not have too much impact.
>
>
> 4. Overview of ITD (and HFI) virtualization
> ===========================================
>
> The main tasks of ITD (including HFI) virtualization are:
> * maintain a virtual HFI table for VM.
> * inject thermal interrupt when HFI table updates.
> * handle related MSRs' emulation and adjust HFI table based on MSR's
> control bits.
> * expose ITD/HFI configuration info in related CPUID leaves.
>
> The most important of these is the maintenance of the virtual HFI table.
> Although the HFI table should also be per package, since ITD/HFI related
> MSRs are treated as per VM in KVM, we also treat the virtual HFI table
> as per VM.
>
>
> 4.1. HFI table building
> ^^^^^^^^^^^^^^^^^^^^^^^
>
> HFI table contains a table header and many table entries. Each table
> entry is identified by an hfi table index, and each CPU corresponds to
> one of the hfi table indexes.
>
> ITD and HFI features both depend on the HFI table, but their HFI table
> are a little different. The HFI table provided by the ITD feature has
> more classes (in terms of more columns in the table) than the HFI table
> of native HFI feature.
>
> The virtual HFI table in KVM is built based on the actual HFI table,
> which is maintained by HFI instance in HFI driver. We extract the HFI
> data of the pCPUs, which vCPUs are running on, to form a virtual HFI
> table.
>
>
> 4.2. HFI table index
> ^^^^^^^^^^^^^^^^^^^^
>
> There are many entries in the HFI table, and the vCPU will be assigned
> an HFI table index to specify the entry it maps. KVM will fill the
> pCPU's HFI data (the pCPU that vCPU is running on) into the entry
> corresponding to the HFI table index of the vCPU in the vcitual HFI
> table.
>
> This index is set by VMM in CPUID.
>
>
> 4.3. HFI table updating
> ^^^^^^^^^^^^^^^^^^^^^^^
>
> On some platforms, the HFI table will be dynamically updated with
> thermal interrupts. In order to update the virtual HFI table in time, we
> added the per-VM notifier to the HFI driver to notify KVM to update the
> virtual HFI table for the VM, and then inject thermal interrupt into the
> VM to notify the Guest.
>
> There is another case that needs to update the virtual HFI table, that
> is, when the vCPU is migrated, the pCPU where it is located is changed,
> and the corresponding virtual HFI data should also be updated to the new
> pCPU's data. In this case, in order to reduce overhead, we can only
> update the data of a single vPCU without traversing the entire virtual
> HFI table.
>
>
> 5. Patch Summary
> ================
>
> Patch 01-03: Prepare the bit definition, the hfi helpers and hfi data
> structures that KVM needs.
> Patch 04-05: Add the sched_out arch hook and reset the classification
> history at sched_in()/schedu_out().
> Patch 06-10: Add emulations of ACPI, TM and PTM, mainly about CPUID and
> related MSRs.
> Patch 11-20: Add the emulation support for HFI, including maintaining
> the HFI table for VM.
> Patch 21-23: Add the emulation support for ITD, including extending HFI
> to ITD and passing through the classification MSRs.
> Patch 24-25: Add HRESET emulation support, which is also used by IPC
> classes feature.
> Patch 26: Add the brief doc about the per-VM lock - pkg_therm_lock.
>
>
> 6. References
> =============
>
> [1]: [PATCH 0/9] thermal: intel: hfi: Prework for the virtualization of HFI
> https://lore.kernel.org/lkml/20240203040515.23947-1-ricardo.neri-calderon@xxxxxxxxxxxxxxx/
> [2]: [RFC 00/52] Introduce hybrid CPU topology,
> https://lore.kernel.org/qemu-devel/20230213095035.158240-1-zhao1.liu@xxxxxxxxxxxxxxx/
> [3]: [RFC 00/41] qom-topo: Abstract Everything about CPU Topology,
> https://lore.kernel.org/qemu-devel/20231130144203.2307629-1-zhao1.liu@xxxxxxxxxxxxxxx/
> [4]: SDM, vol. 3B, section 15.6 HARDWARE FEEDBACK INTERFACE AND INTEL
> THREAD DIRECTOR
>
>
> Thanks and Best Regards,
> Zhao
> ---
> Zhao Liu (17):
> thermal: Add bit definition for x86 thermal related MSRs
> KVM: Add kvm_arch_sched_out() hook
> KVM: x86: Reset hardware history at vCPU's sched_in/out
> KVM: VMX: Add helpers to handle the writes to MSR's R/O and R/WC0 bits
> KVM: x86: cpuid: Define CPUID 0x06.eax by kvm_cpu_cap_mask()
> KVM: VMX: Introduce HFI description structure
> KVM: VMX: Introduce HFI table index for vCPU
> KVM: x86: Introduce the HFI dynamic update request and kvm_x86_ops
> KVM: VMX: Allow to inject thermal interrupt without HFI update
> KVM: VMX: Emulate HFI related bits in package thermal MSRs
> KVM: VMX: Emulate the MSRs of HFI feature
> KVM: x86: Expose HFI feature bit and HFI info in CPUID
> KVM: VMX: Extend HFI table and MSR emulation to support ITD
> KVM: VMX: Pass through ITD classification related MSRs to Guest
> KVM: x86: Expose ITD feature bit and related info in CPUID
> KVM: VMX: Emulate the MSR of HRESET feature
> Documentation: KVM: Add description of pkg_therm_lock
>
> Zhuocheng Ding (9):
> thermal: intel: hfi: Add helpers to build HFI/ITD structures
> thermal: intel: hfi: Add HFI notifier helpers to notify HFI update
> KVM: VMX: Emulate ACPI (CPUID.0x01.edx[bit 22]) feature
> KVM: x86: Expose TM/ACC (CPUID.0x01.edx[bit 29]) feature bit to VM
> KVM: VMX: Emulate PTM/PTS (CPUID.0x06.eax[bit 6]) feature
> KVM: VMX: Support virtual HFI table for VM
> KVM: VMX: Sync update of Host HFI table to Guest
> KVM: VMX: Update HFI table when vCPU migrates
> KVM: x86: Expose HRESET feature's CPUID to Guest
>
> Documentation/virt/kvm/locking.rst | 13 +-
> arch/arm64/include/asm/kvm_host.h | 1 +
> arch/mips/include/asm/kvm_host.h | 1 +
> arch/powerpc/include/asm/kvm_host.h | 1 +
> arch/riscv/include/asm/kvm_host.h | 1 +
> arch/s390/include/asm/kvm_host.h | 1 +
> arch/x86/include/asm/hfi.h | 28 ++
> arch/x86/include/asm/kvm-x86-ops.h | 3 +-
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/include/asm/msr-index.h | 54 +-
> arch/x86/kvm/cpuid.c | 201 +++++++-
> arch/x86/kvm/irq.h | 1 +
> arch/x86/kvm/lapic.c | 9 +
> arch/x86/kvm/svm/svm.c | 8 +
> arch/x86/kvm/vmx/vmx.c | 751 +++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.h | 79 ++-
> arch/x86/kvm/x86.c | 18 +
> drivers/thermal/intel/intel_hfi.c | 212 +++++++-
> drivers/thermal/intel/therm_throt.c | 1 -
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 1 +
> 21 files changed, 1343 insertions(+), 44 deletions(-)
>
> --
> 2.34.1
>