Re: [PATCH v3 00/21] Enable CET Virtualization

From: Yang, Weijiang
Date: Mon Jul 17 2023 - 03:45:06 EST



On 6/24/2023 4:51 AM, Sean Christopherson wrote:
On Mon, Jun 19, 2023, Weijiang Yang wrote:
On 6/17/2023 1:56 AM, Sean Christopherson wrote:
On Fri, Jun 16, 2023, Weijiang Yang wrote:
On 6/16/2023 7:30 AM, Sean Christopherson wrote:
On Thu, May 11, 2023, Yang Weijiang wrote:

[...]
Let me make it clear, you want me to do two things:

1)Add Supervisor Shadow Stack  state support(i.e., XSS.bit12(CET_S)) into
kernel so that host can support guest Supervisor Shadow Stack MSRs in g/h FPU
context switch.
If that's necessary for correct functionality, yes.

Hi, Sean,

I held off posting the new version and want to sync up with you on this point to avoid

surprising you.

After discussed adding the patch in kernel with Rick and Chao, we got blow conclusions

on doing so:

the Pros:
 - Super easy to implement for KVM.
 - Automatically avoids saving and restoring this data when the vmexit
   is handled within KVM.

the Cons:
 - Unnecessarily restores XFEATURE_CET_KERNEL when switching to
   non-KVM task's userspace.
 - Forces allocating space for this state on all tasks, whether or not
   they use KVM, and with likely zero users today and the near future.
 - Complicates the FPU optimization thinking by including things that
   can have no affect on userspace in the FPU

Given above reasons, I implemented guest CET supervisor states management

in KVM instead of adding a kernel patch for it.

Below are 3 KVM patches to support it:

Patch 1: Save/reload guest CET supervisor states when necessary:

=======================================================================

commit 16147ede75dee29583b7d42a6621d10d55b63595
Author: Yang Weijiang <weijiang.yang@xxxxxxxxx>
Date:   Tue Jul 11 02:26:17 2023 -0400

    KVM:x86: Make guest supervisor states as non-XSAVE managed

    Save and reload guest CET supervisor states, i.e.,PL{0,1,2}_SSP,
    when vCPU context is being swapped before and after userspace
    <->kernel entry, also do the same operation when vCPU is sched-in
    or sched-out.

    Enabling CET supervisor state management only in KVM due to:
    1) Currently, suervisor SHSTK is not enabled on host side, only
    KVM needs to care about the guest's suervisor SHSTK states.
    2) Enabling them in kernel FPU state framework has global effects
    to all threads on host kernel, but the majority of the threads
    are free to CET supervisor states. And it requires additional
    storage size of thread FPU state area.

    Add a new helper kvm_arch_sched_out() for that purpose. Adding
    the support in kvm_arch_vcpu_put/load() without the new helper
    looks possible, but the put/load functions are also called in
    vcpu_put()/load(), the latter are heavily used in KVM, so adding
    new helper makes the implementation clearer.

    Signed-off-by: Yang Weijiang <weijiang.yang@xxxxxxxxx>

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 7e7e19ef6993..98235cb3d258 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1023,6 +1023,7 @@ void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu);

 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 void kvm_arm_init_debug(void);
 void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 957121a495f0..56c5e85ba5a3 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -893,6 +893,7 @@ static inline void kvm_arch_free_memslot(struct kvm *kvm,
                                         struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..11587d953bf6 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -880,6 +880,7 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}

diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
index ee0acccb1d3b..6ff4a04fe0f2 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -244,6 +244,7 @@ struct kvm_vcpu_arch {

 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 #define KVM_ARCH_WANT_MMU_NOTIFIER

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 2bbc3d54959d..d1750a6a86cf 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -1033,6 +1033,7 @@ extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc);

 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
                                         struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e2c549f147a5..7d9cfb7e2fe8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11212,6 +11212,31 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
        trace_kvm_fpu(0);
 }

+static void kvm_save_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
+{
+       preempt_disable();
+       if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
+               rdmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
+               rdmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
+               rdmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
+               wrmsrl(MSR_IA32_PL0_SSP, 0);
+               wrmsrl(MSR_IA32_PL1_SSP, 0);
+               wrmsrl(MSR_IA32_PL2_SSP, 0);
+       }
+       preempt_enable();
+}
+
+static void kvm_reload_cet_supervisor_ssp(struct kvm_vcpu *vcpu)
+{
+       preempt_disable();
+       if (unlikely(guest_can_use(vcpu, X86_FEATURE_SHSTK))) {
+               wrmsrl(MSR_IA32_PL0_SSP, vcpu->arch.cet_s_ssp[0]);
+               wrmsrl(MSR_IA32_PL1_SSP, vcpu->arch.cet_s_ssp[1]);
+               wrmsrl(MSR_IA32_PL2_SSP, vcpu->arch.cet_s_ssp[2]);
+       }
+       preempt_enable();
+}
+
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 {
        struct kvm_queued_exception *ex = &vcpu->arch.exception;
@@ -11222,6 +11247,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
        kvm_sigset_activate(vcpu);
        kvm_run->flags = 0;
        kvm_load_guest_fpu(vcpu);
+       kvm_reload_cet_supervisor_ssp(vcpu);

        kvm_vcpu_srcu_read_lock(vcpu);
        if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11310,6 +11336,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
        r = vcpu_run(vcpu);

 out:
+       kvm_save_cet_supervisor_ssp(vcpu);
        kvm_put_guest_fpu(vcpu);
        if (kvm_run->kvm_valid_regs)
                store_regs(vcpu);
@@ -12398,9 +12425,17 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
                pmu->need_cleanup = true;
                kvm_make_request(KVM_REQ_PMU, vcpu);
        }
+
+       kvm_reload_cet_supervisor_ssp(vcpu);
+
        static_call(kvm_x86_sched_in)(vcpu, cpu);
 }
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu)
+{
+       kvm_save_cet_supervisor_ssp(vcpu);
+}
+
 void kvm_arch_free_vm(struct kvm *kvm)
 {
        kfree(to_kvm_hv(kvm)->hv_pa_pg);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d90331f16db1..b3032a5f0641 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1423,6 +1423,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu);

 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu);

 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66c1447d3c7f..42f28e8905e1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5885,6 +5885,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 {
        struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

+       kvm_arch_sched_out(vcpu, 0);
        if (current->on_rq) {
                WRITE_ONCE(vcpu->preempted, true);
                WRITE_ONCE(vcpu->ready, true);

Patch 2: optimization patch for above one:

===================================================================

commit ae5fe7c81cc3b93193758d1b7b4ab74a92a51dad
Author: Yang Weijiang <weijiang.yang@xxxxxxxxx>
Date:   Fri Jul 14 20:03:52 2023 -0400

    KVM:x86: Optimize CET supervisor SSP save/reload

    Make PL{0,1,2}_SSP as write-intercepted to detect whether
    guest is using these MSRs. Disable intercept to the MSRs
    if they're written with non-zero values. KVM does save/
    reload for the MSRs only if they're used by guest.

    Signed-off-by: Yang Weijiang <weijiang.yang@xxxxxxxxx>

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 69cbc9d9b277..c50b555234fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -748,6 +748,7 @@ struct kvm_vcpu_arch {
        bool tpr_access_reporting;
        bool xsaves_enabled;
        bool xfd_no_write_intercept;
+       bool cet_sss_active;
        u64 ia32_xss;
        u64 microcode_version;
        u64 arch_capabilities;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 90ce1c7d3fd7..21c89d200c88 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2156,6 +2156,18 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
        return debugctl;
 }

+static void vmx_disable_write_intercept_sss_msr(struct kvm_vcpu *vcpu)
+{
+       if (guest_can_use(vcpu, X86_FEATURE_SHSTK)) {
+               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
+                               MSR_TYPE_RW, false);
+               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
+                               MSR_TYPE_RW, false);
+               vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
+                               MSR_TYPE_RW, false);
+       }
+}
+
 /*
  * Writes msr value into the appropriate "register".
  * Returns 0 on success, non-0 otherwise.
@@ -2427,7 +2439,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 #define VMX_CET_CONTROL_MASK   (~GENMASK_ULL(9,6))
 #define LEG_BITMAP_BASE(data)  ((data) >> 12)
        case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
-               return kvm_set_msr_common(vcpu, msr_info);
+               if (kvm_set_msr_common(vcpu, msr_info))
+                       return 1;
+               /*
+                * Write to the base SSP MSRs should happen ahead of toggling
+                * of IA32_S_CET.SH_STK_EN bit.
+                */
+               if (!msr_info->host_initiated &&
+                   msr_index != MSR_IA32_PL3_SSP && data) {
+                       vmx_disable_write_intercept_sss_msr(vcpu);
+                       wrmsrl(msr_index, data);
+               }
                break;
        case MSR_IA32_U_CET:
        case MSR_IA32_S_CET:
@@ -7773,12 +7795,17 @@ static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
                                MSR_TYPE_RW, false);
                vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
                                MSR_TYPE_RW, false);
+               /*
+                * Supervisor shadow stack MSRs are intercepted until
+                * they're written by guest, this is designed to
+                * optimize the save/restore overhead.
+                */
                vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
-                               MSR_TYPE_RW, false);
+                               MSR_TYPE_R, false);
                vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
-                               MSR_TYPE_RW, false);
+                               MSR_TYPE_R, false);
                vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
-                               MSR_TYPE_RW, false);
+                               MSR_TYPE_R, false);
                vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cab31dbb2bec..06dc5111da3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4049,8 +4049,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
                if (!IS_ALIGNED(data, 4))
                        return 1;
                if (msr == MSR_IA32_PL0_SSP || msr == MSR_IA32_PL1_SSP ||
-                   msr == MSR_IA32_PL2_SSP)
+                   msr == MSR_IA32_PL2_SSP) {
+                       if (!msr_info->host_initiated && data)
+                               vcpu->arch.cet_sss_active = true;
                        vcpu->arch.cet_s_ssp[msr - MSR_IA32_PL0_SSP] = data;
+               }
                else if (msr == MSR_IA32_PL3_SSP)
                        kvm_set_xsave_msr(msr_info);
                break;
@@ -11250,7 +11253,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
        kvm_sigset_activate(vcpu);
        kvm_run->flags = 0;
        kvm_load_guest_fpu(vcpu);
-       kvm_reload_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_reload_cet_supervisor_ssp(vcpu);

        kvm_vcpu_srcu_read_lock(vcpu);
        if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
        r = vcpu_run(vcpu);

 out:
-       kvm_save_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_save_cet_supervisor_ssp(vcpu);
        kvm_put_guest_fpu(vcpu);
        if (kvm_run->kvm_valid_regs)
                store_regs(vcpu);
        kvm_vcpu_srcu_read_lock(vcpu);
        if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
@@ -11339,7 +11343,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
        r = vcpu_run(vcpu);

 out:
-       kvm_save_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_save_cet_supervisor_ssp(vcpu);
        kvm_put_guest_fpu(vcpu);
        if (kvm_run->kvm_valid_regs)
                store_regs(vcpu);
@@ -12428,15 +12433,16 @@ void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
                pmu->need_cleanup = true;
                kvm_make_request(KVM_REQ_PMU, vcpu);
        }
-
-       kvm_reload_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_reload_cet_supervisor_ssp(vcpu);

        static_call(kvm_x86_sched_in)(vcpu, cpu);
 }

 void kvm_arch_sched_out(struct kvm_vcpu *vcpu, int cpu)
 {
-       kvm_save_cet_supervisor_ssp(vcpu);
+       if (vcpu->arch.cet_sss_active)
+               kvm_save_cet_supervisor_ssp(vcpu);
 }

 void kvm_arch_free_vm(struct kvm *kvm)

=============================================================

Patch 3: support guest CET supervisor xstate bit:

commit 2708b3c959db56fb9243f9a157884c2120b8810c
Author: Yang Weijiang <weijiang.yang@xxxxxxxxx>
Date:   Sat Jul 15 20:56:37 2023 -0400

    KVM:x86: Enable guest CET supervisor xstate bit support

    Add S_CET bit in kvm_caps.supported_xss so that guest can enumerate
    the feature in CPUID(0xd,1).ECX.

    Guest S_CET xstate bit is specially handled, i.e., it can be exposed
    without related enabling on host side, because KVM manually saves/reloads
    guest supervisor SHSTK SSPs and current XSS swap logic for host/guest aslo
    supports doing so, thus it's safe to enable the bit without host support.

    Signed-off-by: Yang Weijiang <weijiang.yang@xxxxxxxxx>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2653e5eb54ee..071bcdedc530 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -228,7 +228,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
                                | XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
                                | XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)

-#define KVM_SUPPORTED_XSS      (XFEATURE_MASK_CET_USER)
+#define KVM_SUPPORTED_XSS      (XFEATURE_MASK_CET_USER | \
+                                XFEATURE_MASK_CET_KERNEL)

 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
@@ -9639,6 +9640,7 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
        if (boot_cpu_has(X86_FEATURE_XSAVES)) {
                rdmsrl(MSR_IA32_XSS, host_xss);
                kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
+               kvm_caps.supported_xss |= XFEATURE_MASK_CET_KERNEL;
        }

        kvm_init_pmu_capability(ops->pmu_ops);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f8f042c91728..df187d7c3e74 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -362,7 +362,7 @@ static inline bool kvm_mpx_supported(void)
                == (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
 }

-#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER)
+#define CET_XSTATE_MASK (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)
 /*
  * Shadow Stack and Indirect Branch Tracking feature enabling depends on
  * whether host side CET user xstate bit is supported or not.

=================================================================

What's your thoughts on the solution? Is it appropriate for KVM?

Thanks!

[...]