Re: [PATCH v15 17/23] x86/kexec: Flush cache of TDX private memory

From: Dave Hansen
Date: Mon Nov 27 2023 - 15:06:06 EST


On 11/27/23 11:33, Huang, Kai wrote:
> On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
>> On 11/9/23 03:55, Kai Huang wrote:
>> ...
>>> --- a/arch/x86/kernel/reboot.c
>>> +++ b/arch/x86/kernel/reboot.c
>>> @@ -31,6 +31,7 @@
>>> #include <asm/realmode.h>
>>> #include <asm/x86_init.h>
>>> #include <asm/efi.h>
>>> +#include <asm/tdx.h>
>>>
>>> /*
>>> * Power off function, if any
>>> @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
>>> local_irq_disable();
>>> stop_other_cpus();
>>> #endif
>>> + /*
>>> + * stop_other_cpus() has flushed all dirty cachelines of TDX
>>> + * private memory on remote cpus. Unlike SME, which does the
>>> + * cache flush on _this_ cpu in the relocate_kernel(), flush
>>> + * the cache for _this_ cpu here. This is because on the
>>> + * platforms with "partial write machine check" erratum the
>>> + * kernel needs to convert all TDX private pages back to normal
>>> + * before booting to the new kernel in kexec(), and the cache
>>> + * flush must be done before that. If the kernel took SME's way,
>>> + * it would have to muck with the relocate_kernel() assembly to
>>> + * do memory conversion.
>>> + */
>>> + if (platform_tdx_enabled())
>>> + native_wbinvd();
>>
>> Why can't the TDX host code just set host_mem_enc_active=1?
>>
>> Sure, you'll end up *using* the SME WBINVD support, but then you don't
>> have two different WBINVD call sites. You also don't have to mess with
>> a single line of assembly.
>
> I wanted to avoid changing the assembly.
>
> Perhaps the comment isn't very clear. Flushing cache (on the CPU running kexec)
> when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> which happens at very late stage right before jumping to the new kernel.
> However for TDX when the platform has erratum we need to convert TDX private
> pages back to normal, which must be done after flushing cache. If we reuse
> host_mem_enc_active=1, then we will need to change the assembly code to do that.

I honestly think you need to stop thinking about the partial write issue
at this point in the series. It's really causing a horrible amount of
unnecessary confusion.

Here's the golden rule:

The boot CPU needs to run WBINVD sometime after it stops writing
to private memory but before it starts treating the memory as
shared.

On SME kernels, that key point evidently in early boot when it's
enabling SME. I _think_ that point is also a valid place to do WBINVD
on no-TDX-erratum systems.

On TDX systems with the erratum, there's a *second* point before the
private=>shared conversion occurs. I think what you're trying to do
here is prematurely optimize the erratum-affected affected systems so
that they don't do two WBINVDs. Please stop trying to do that.

This WBINVD is only _needed_ for the erratum. It should be closer to
the actual erratum handing.