Re: [PATCH v5 09/19] KVM:x86: Make guest supervisor states as non-XSAVE managed

From: Yang, Weijiang
Date: Thu Aug 10 2023 - 05:30:20 EST


Hi, Dave, Thomas and Peter,

I would like to connect you to this discussion loop about CET supervisor states support
in kernel so that you may directly talk to KVM maintainers, thanks!
The discussion background/problem/solution as below:

Background:
When KVM enumerates shadow stack support for guest in CPUID(0x7, 0).ECX[bit7],
architecturally it claims both SS user and supervisor mode are supported. Although
the latter is not supported in Linux, but in virtualization world, the guest OS could
be non-Linux system, so KVM supervisor state support is necessary in this case.
Two solutions are on the table:
1) Enable CET supervisor support in Linux kernel like user mode support.
2) Enable support in KVM domain.

Problem:
The Pros/Cons for each solution(my individual thoughts):
In kernel solution:
Pros:
- Avoid saving/restoring 3 supervisor MSRs(PL{0,1,2}_SSP) at vCPU execution path.
- Easy for KVM to manage guest CET xstate bits for guest.
Cons:
- Unnecessary supervisor state xsaves/xrstors operation for non-vCPU thread.
- Potentially extra storage space(24 bytes) for thread context.

KVM solution:
Pros:
- Not touch current kernel FPU management framework and logic.
- No extra space and operation for non-vCPU thread.
Cons:
- Manually saving/restoring 3 supervisor MSRs is a performance burden to KVM.
- It looks more like a hack method for KVM, and some handling logic seems a bit awkward.

KVM maintainers request it supported in kernel instead of KVM to make things streamlined.

We'd like to hear your voice of in kernel solution, favor vs. objection?
Any important points we omitted?
Appreciated!

Solution:
Below is the supervisor state enabling patch in kernel, not include Sean's suggestion below.
=====================================================================
From 53f9890c76e4163a0fead3afe198d0c17136120e Mon Sep 17 00:00:00 2001
From: Yang Weijiang <weijiang.yang@xxxxxxxxx>
Date: Thu, 10 Aug 2023 00:10:55 -0400
Subject: [RFC PATCH] x86: fpu: Enable CET supervisor state support

Enable CET supervisor state support within current FPU states management
framework. CET shadow stack feature is enabled with CPUID(0x7,0).ECX[bit7],
if the bit is set, archchtectually both user and supervisor SHSTK should
be supported, i.e., when KVM enumerates the feature bit to guest, it claims
both modes are supported by the VMM.

The user mode SHSTK XSAVE states comprise of IA32_{U_CET,PL3_SSP},
and supervisor mode states inlude IA32_PL{0,1,2}_SSP. The xstate support
for the former is enclosed in native user mode shadow stack series, but
the latter is not supported yet.

KVM is going to support guest shadow stack which means guest's supervisor
shadow stack states should also be well managed by VMM.

To make KVM fully support guest shadow stack states, there're at least two
approaches, one is to enable supervisor xstate bit in kernel, which is
straightforward and can fit well in all cases. The alternative is to enable
the support within KVM domain and manually save/restore the states per vCPU
thread, i.e., introduce addtional WRMSR/RDMSR at vCPU execution path.

This patch doesn't optimize CET supervisor state management, just follow
the implementation of user mode state support.

Signed-off-by: Yang Weijiang <weijiang.yang@xxxxxxxxx>
---
 arch/x86/include/asm/fpu/types.h  | 14 ++++++++++++--
 arch/x86/include/asm/fpu/xstate.h |  6 +++---
 arch/x86/kernel/fpu/xstate.c      |  6 +++++-
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb810074f1e7..c6fd13a17205 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -116,7 +116,7 @@ enum xfeature {
        XFEATURE_PKRU,
        XFEATURE_PASID,
        XFEATURE_CET_USER,
-       XFEATURE_CET_KERNEL_UNUSED,
+       XFEATURE_CET_KERNEL,
        XFEATURE_RSRVD_COMP_13,
        XFEATURE_RSRVD_COMP_14,
        XFEATURE_LBR,
@@ -139,7 +139,7 @@ enum xfeature {
 #define XFEATURE_MASK_PKRU             (1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID            (1 << XFEATURE_PASID)
 #define XFEATURE_MASK_CET_USER         (1 << XFEATURE_CET_USER)
-#define XFEATURE_MASK_CET_KERNEL       (1 << XFEATURE_CET_KERNEL_UNUSED)
+#define XFEATURE_MASK_CET_KERNEL       (1 << XFEATURE_CET_KERNEL)
 #define XFEATURE_MASK_LBR              (1 << XFEATURE_LBR)
 #define XFEATURE_MASK_XTILE_CFG                (1 << XFEATURE_XTILE_CFG)
 #define XFEATURE_MASK_XTILE_DATA       (1 << XFEATURE_XTILE_DATA)
@@ -264,6 +264,16 @@ struct cet_user_state {
        u64 user_ssp;
 };

+/*
+ * State component 12 is Control-flow Enforcement supervisor states
+ */
+struct cet_supervisor_state {
+       /* supervisor ssp pointers  */
+       u64 pl0_ssp;
+       u64 pl1_ssp;
+       u64 pl2_ssp;
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index d4427b88ee12..3b4a038d3c57 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -51,7 +51,8 @@

 /* All currently supported supervisor features */
 #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
-                                           XFEATURE_MASK_CET_USER)
+                                           XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL)

 /*
  * A supervisor state component may not always contain valuable information,
@@ -78,8 +79,7 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
- XFEATURE_MASK_CET_KERNEL)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)

 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 4fa4751912d9..fc346c7c6916 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -51,7 +51,7 @@ static const char *xfeature_names[] =
        "Protection Keys User registers",
        "PASID state",
        "Control-flow User registers",
-       "Control-flow Kernel registers (unused)",
+       "Control-flow Kernel registers",
        "unknown xstate feature",
        "unknown xstate feature",
        "unknown xstate feature",
@@ -74,6 +74,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
        [XFEATURE_PKRU]                         = X86_FEATURE_PKU,
        [XFEATURE_PASID]                        = X86_FEATURE_ENQCMD,
        [XFEATURE_CET_USER]                     = X86_FEATURE_SHSTK,
+       [XFEATURE_CET_KERNEL]                   = X86_FEATURE_SHSTK,
        [XFEATURE_XTILE_CFG]                    = X86_FEATURE_AMX_TILE,
        [XFEATURE_XTILE_DATA]                   = X86_FEATURE_AMX_TILE,
 };
@@ -278,6 +279,7 @@ static void __init print_xstate_features(void)
        print_xstate_feature(XFEATURE_MASK_PKRU);
        print_xstate_feature(XFEATURE_MASK_PASID);
        print_xstate_feature(XFEATURE_MASK_CET_USER);
+       print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
        print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
        print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
@@ -347,6 +349,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
         XFEATURE_MASK_BNDCSR |                 \
         XFEATURE_MASK_PASID |                  \
         XFEATURE_MASK_CET_USER |               \
+        XFEATURE_MASK_CET_KERNEL |             \
         XFEATURE_MASK_XTILE)

 /*
@@ -547,6 +550,7 @@ static bool __init check_xstate_against_struct(int nr)
        case XFEATURE_PASID:      return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
        case XFEATURE_XTILE_CFG:  return XCHECK_SZ(sz, nr, struct xtile_cfg);
        case XFEATURE_CET_USER:   return XCHECK_SZ(sz, nr, struct cet_user_state);
+       case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
        case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
        default:
                XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
--
2.27.0




On 8/5/2023 4:45 AM, Sean Christopherson wrote:
[...]
Pulling back in the justification from v3:

the Pros:
- Super easy to implement for KVM.
- Automatically avoids saving and restoring this data when the vmexit
is handled within KVM.

the Cons:
- Unnecessarily restores XFEATURE_CET_KERNEL when switching to
non-KVM task's userspace.
- Forces allocating space for this state on all tasks, whether or not
they use KVM, and with likely zero users today and the near future.
- Complicates the FPU optimization thinking by including things that
can have no affect on userspace in the FPU

IMO the pros far outweigh the cons. 3x RDMSR and 3x WRMSR when loading host/guest
state is non-trivial overhead. That can be mitigated, e.g. by utilizing the
user return MSR framework, but it's still unpalatable. It's unlikely many guests
will SSS in the *near* future, but I don't want to end up with code that performs
poorly in the future and needs to be rewritten.

Especially because another big negative is that not utilizing XSTATE bleeds into
KVM's ABI. Userspace has to be told to manually save+restore MSRs instead of just
letting KVM_{G,S}ET_XSAVE handle the state. And that will create a bit of a
snafu if Linux does gain support for SSS.

On the other hand, the extra per-task memory is all of 24 bytes. AFAICT, there's
literally zero effect on guest XSTATE allocations because those are vmalloc'd and
thus rounded up to PAGE_SIZE, i.e. the next 4KiB. And XSTATE needs to be 64-byte
aligned, so the 24 bytes is only actually meaningful if the current size is within
24 bytes of the next cahce line. And the "current" size is variable depending on
which features are present and enabled, i.e. it's a roll of the dice as to whether
or not using XSTATE for supervisor CET would actually increase memory usage. And
_if_ it does increase memory consumption, I have a very hard time believing an
extra 64 bytes in the worst case scenario is a dealbreaker.

If the performance is a concern, i.e. we don't want to eat saving/restoring the
MSRs when switching to/from host FPU context, then I *think* that's simply a matter
of keeping guest state resident when loading non-guest FPU state.

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1015af1ae562..8e7599e3b923 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -167,6 +167,16 @@ void restore_fpregs_from_fpstate(struct fpstate *fpstate, u64 mask)
*/
xfd_update_state(fpstate);
+ /*
+ * Leave supervisor CET state as-is when loading host state
+ * (kernel or userspace). Supervisor CET state is managed via
+ * XSTATE for KVM guests, but the host never consumes said
+ * state (doesn't support supervisor shadow stacks), i.e. it's
+ * safe to keep guest state loaded into hardware.
+ */
+ if (!fpstate->is_guest)
+ mask &= ~XFEATURE_MASK_CET_KERNEL;
+
/*
* Restoring state always needs to modify all features
* which are in @mask even if the current task cannot use


So unless I'm missing something, NAK to this approach, at least not without trying
the kernel FPU approach, i.e. I want somelike like to PeterZ or tglx to actually
full on NAK the kernel approach before we consider shoving a hack into KVM.