Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()

From: Huang, Kai
Date: Mon Mar 25 2024 - 11:36:02 EST

Next message: Jason Gunthorpe: "Re: [PATCH] mm/arch: Provide pud_pfn() fallback"
Previous message: Andrew Lunn: "Re: [EXTERNAL] Re: [PATCH v2 0/5] Fix prestera driver fail to probe twice"
Next in thread: kirill.shutemov@xxxxxxxxxxxxxxx: "Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 2024-03-22 at 09:50 -0500, Tom Lendacky wrote:
> On 3/22/24 05:40, Kirill A. Shutemov wrote:
> > On Thu, Mar 21, 2024 at 04:02:11PM -0500, Tom Lendacky wrote:
> > > On 3/20/24 18:10, Kirill A. Shutemov wrote:
> > > > On Thu, Mar 21, 2024 at 09:48:28AM +1300, Huang, Kai wrote:
> > > > >
> > > > > > > Hi Tom,
> > > > > > >
> > > > > > > I am not aware of kexec() support status for SEV-ES/SEV-SNP guests.
> > > > > > > Does patch 1 break them?
> > > > > >
> > > > > > SNP guests can kexec with some patches that are currently in process
> > > > > > around shared to private memory conversions. ES guests can only kexec
> > > > > > with a single vCPU. There was a recent patch series to add support for
> > > > > > multiple vCPUs.
> > > > > >
> > > > > > Patch #1 doesn't break either ES or SNP because we still have an IDT and
> > > > > > traditional kernel addressing in place, so the #VC can be handled.
> > > > >
> > > > > How about plain SEV guest?
> > > > >
> > > > > >
> > > > > > Whereas patch #2 has switched to identity mapping and removed the IDT,
> > > > > > so a #VC causes a triple fault.
> > > > >
> > > > > That makes sense. Thanks.
> > > > >
> > > > > Hi Kirill,
> > > > >
> > > > > Does TDX guest have similar behaviour -- that WBINVD in stop_this_cpu() can
> > > > > be handled although it causes #VE, while WBINVD in relocate_kernel() will
> > > > > just triple fault the guest?
> > > >
> > > > No. We never handle WBINVD #VE. Guest cannot handle WBINVD itself and the
> > > > only option is to ask host to do this. We cannot guarantee host will do
> > >
> > > Is the WBINVD performed or ignored in that case?
> >
> > We crash the guest if it tries to use WBINVD. There's no legitimate reason
> > for it.
> >
> > > > anything useful with the request. I guess it can be potential attack
> > > > vector if host strategically ignores WBINVD to induce bad guest behaviour.
> > >
> > > With SNP, memory is coherent so there isn't a need for a WBINVD within a
> > > guest and so issuing it should not be an issue whether the hypervisor
> > > performs the operation or not. I don't know what can happen in the case
> > > where, say, you have a non-coherent TDISP device attached or such, but that
> > > would be very unusual/unlikely.
> >
> > Looks like SNP is in the same position as TDX.
> >
> > > > And it is not good from host PoV either. If it does WBINVD on every guest
> > > > request we get guest->host DoS attack possibility.
> > >
> > > Yeah, that can happen today, regardless of the type of VM running.
> > >
> > > >
> > > > Tom, I am curious, how do you deal with these problems?
> > >
> > > If the WBINVD is being intercepted, then it will generate a #VC and we use
> > > the GHCB protocol to communicate that back to the hypervisor to handle.
> >
> > I would argue that forwarding it to hypervisor is worse than crashing. It
> > gives false sense of doing something. Hypervisor is outside TCB and
> > considered hostile.
>
> Since the memory is coherent, it really doesn't matter what the hypervisor
> does in regards to WBINVD (ignore it or perform it). And the hypervisor
> can do anything it wants on any exit, regardless of this intercept.
>

I guess it makes sense to not handle #VE due to WBINVD in the sense that guest
shouldn't do WBINVD when memory is coherent from guest's view, although it is
harmless to make the WBINVD and let hypervisor handle it.

Anyway, the current TDX guest doesn't handle #VE due to WBINVD, so I think for
simplicity we just don't do WBINVD in stop_this_cpu() and relocate_kernel() for
both TDX and SNP/SEV-ES guests.

As mentioned in my earlier reply, we can achieve this by skipping WBINVD when
the CC_ATTR_GUEST_MEM_ENCRYPT is true:

if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
native_wbinvd();

(This skips WBINVD for plain SEV guest too, but this exactly is the current
behaviour of the upstream code, so I don't see any problem.)

Alternatively, we can have a dedicated CPU feature flag such as
X86_FEATURE_NO_WBINVD,

if (!boot_cpu_has(X86_FEATURE_NO_WBINVD))
native_wbinvd();

Or, we can just change to our mindset to "do unconditional WBINVD, but not in
virtualized environment":

if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
native_wbinvd();

Hi Boris/Dave/Tom/Kirill,

Do you have any comments?

Next message: Jason Gunthorpe: "Re: [PATCH] mm/arch: Provide pud_pfn() fallback"
Previous message: Andrew Lunn: "Re: [EXTERNAL] Re: [PATCH v2 0/5] Fix prestera driver fail to probe twice"
Next in thread: kirill.shutemov@xxxxxxxxxxxxxxx: "Re: [PATCH v2 2/5] x86/kexec: do unconditional WBINVD in relocate_kernel()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]