Re: [PATCH 3/4] x86/kexec(): Reset TDX private memory on platforms with TDX erratum

From: Dave Hansen
Date: Wed Jan 31 2024 - 16:24:40 EST


On 1/31/24 03:31, Huang, Kai wrote:
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -28,6 +28,7 @@
> #include <asm/setup.h>
> #include <asm/set_memory.h>
> #include <asm/cpu.h>
> +#include <asm/tdx.h>
>
> #ifdef CONFIG_ACPI
> /*
> @@ -298,9 +299,24 @@ void machine_kexec(struct kimage *image)
> void *control_page;
> int save_ftrace_enabled;
>
> + /*
> + * For platforms with TDX "partial write machine check" erratum,
> + * all TDX private pages need to be converted back to normal
> + * before booting to the new kernel, otherwise the new kernel
> + * may get unexpected machine check.
> + *
> + * But skip this when preserve_context is on. The second kernel
> + * shouldn't write to the first kernel's memory anyway. Skipping
> + * this also avoids killing TDX in the first kernel, which would
> + * require more complicated handling.
> + */

This is waaaaaaaaaaaaaaay too much information to stick in a generic
function. Just call tdx_reset_memory() and make the argument about

> #ifdef CONFIG_KEXEC_JUMP
> if (image->preserve_context)
> save_processor_state();
> + else
> + tdx_reset_memory();
> +#else
> + tdx_reset_memory();
> #endif

Wow, that's awfully hard to read. I really wish folks' gag reflex would
kick in when they see stuff like this to get them to spend an additional
15 seconds to turn this into:

if (IS_ENABLED(CONFIG_KEXEC_JUMP) && image->preserve_context)
save_processor_state();
else
tdx_reset_memory();

> save_ftrace_enabled = __ftrace_enabled_save();
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 9f1fed458a32..0537b1b76c2b 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -28,6 +28,7 @@
> #include <linux/acpi.h>
> #include <linux/suspend.h>
> #include <linux/acpi.h>
> +#include <linux/reboot.h>
> #include <asm/page.h>
> #include <asm/special_insns.h>
> #include <asm/msr-index.h>
> @@ -54,6 +55,8 @@ static DEFINE_MUTEX(tdx_module_lock);
> /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */
> static LIST_HEAD(tdx_memlist);
>
> +static bool tdx_rebooting;
> +
> typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
>
> static inline void seamcall_err(u64 fn, u64 err, struct tdx_module_args *args)
> @@ -1187,6 +1190,9 @@ static int __tdx_enable(void)
> {
> int ret;
>
> + if (tdx_rebooting)
> + return -EAGAIN;

This whole 'tdx_rebooting' deserves to at least be in its own patch.
Also -EAGAIN seems to be a really odd return code for a permanent failure.

> ret = init_tdx_module();
> if (ret) {
> pr_err("module initialization failed (%d)\n", ret);
> @@ -1420,6 +1426,90 @@ static struct notifier_block tdx_memory_nb = {
> .notifier_call = tdx_memory_notifier,
> };
>
> +/*
> + * Convert TDX private pages back to normal on platforms with
> + * "partial write machine check" erratum.
> + *
> + * Called from machine_kexec() before booting to the new kernel.
> + */
> +void tdx_reset_memory(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_TDX_HOST_PLATFORM))
> + return;
> +
> + /*
> + * Kernel read/write to TDX private memory doesn't
> + * cause machine check on hardware w/o this erratum.
> + */
> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> + return;
> +
> + /* Called from kexec() when only rebooting cpu is alive */
> + WARN_ON_ONCE(num_online_cpus() != 1);
> +
> + /*
> + * tdx_reboot_notifier() waits until ongoing TDX module
> + * initialization to finish, and module initialization is
> + * rejected after that. Therefore @tdx_module_status is
> + * stable here and can be read w/o holding lock.
> + */
> + if (tdx_module_status != TDX_MODULE_INITIALIZED)
> + return;

kexec() can't happen until after reboot notifiers are called.
tdx_reboot_notifier() makes 'tdx_module_status' stable, so no lock is
needed.

Right?

> + /*
> + * Flush cache of all TDX private memory _before_ converting
> + * them back to avoid silent memory corruption.
> + */
> + native_wbinvd();

Since this is single-threaded, it also needs to mention where all the
other CPU caches got flushed.

> + /*
> + * Convert PAMTs back to normal. All other cpus are already
> + * dead and TDMRs/PAMTs are stable.
> + *
> + * Ideally it's better to cover all types of TDX private pages
> + * here, but it's impractical:
> + *
> + * - There's no existing infrastructure to tell whether a page
> + * is TDX private memory or not.
> + *
> + * - Using SEAMCALL to query TDX module isn't feasible either:
> + * - VMX has been turned off by reaching here so SEAMCALL
> + * cannot be made;
> + * - Even SEAMCALL can be made the result from TDX module may

^ if ^ a ^,

> + * not be accurate (e.g., remote CPU can be stopped while
> + * the kernel is in the middle of reclaiming TDX private
> + * page and doing MOVDIR64B).
> + *
> + * One temporary solution could be just converting all memory
> + * pages, but it's problematic too, because not all pages are
> + * mapped as writable in direct mapping. It can be done by
> + * switching to the identical mapping for kexec() or a new page
> + * table which maps all pages as writable, but the complexity is
> + * overkill.
> + *
> + * Thus instead of doing something dramatic to convert all pages,
> + * only convert PAMTs here. Other kernel components which use
> + * TDX need to do the conversion on their own by intercepting the
> + * rebooting/shutdown notifier (KVM already does that).
> + */

I'd leave the extended alternatives discussion in the changelog, not here.

Focus on what _this_ is doing and why it is imperfect:

1. Just reset the PAMDs
2. This leaves A, B, and C undealt with
3. The risk of leaving those is ...


> + tdmrs_reset_pamt_all(&tdx_tdmr_list);
> +}
> +
> +static int tdx_reboot_notifier(struct notifier_block *nb, unsigned long mode,
> + void *unused)
> +{
> + /* Wait ongoing TDX initialization to finish */
> + mutex_lock(&tdx_module_lock);
> + tdx_rebooting = true;
> + mutex_unlock(&tdx_module_lock);
> +
> + return NOTIFY_OK;
> +}

Why is 'tdx_rebooting' a new variable instead of a new state in
tdx_module_status?