Re: [PATCH v5 12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory

From: Kai Huang
Date: Mon Jun 27 2022 - 02:16:30 EST


On Fri, 2022-06-24 at 12:40 -0700, Dave Hansen wrote:
> On 6/22/22 04:17, Kai Huang wrote:
> ...
> > Also, explicitly exclude memory regions below first 1MB as TDX memory
> > because those regions may not be reported as convertible memory. This
> > is OK as the first 1MB is always reserved during kernel boot and won't
> > end up to the page allocator.
>
> Are you sure? I wasn't for a few minutes until I found reserve_real_mode()
>
> Could we point to that in this changelog, please?

OK will explicitly point out reserve_real_mode().

>
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index efa830853e98..4988a91d5283 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1974,6 +1974,7 @@ config INTEL_TDX_HOST
> > depends on X86_64
> > depends on KVM_INTEL
> > select ARCH_HAS_CC_PLATFORM
> > + select ARCH_KEEP_MEMBLOCK
> > help
> > Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
> > host and certain physical attacks. This option enables necessary TDX
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 1bc97756bc0d..2b20d4a7a62b 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -15,6 +15,8 @@
> > #include <linux/cpumask.h>
> > #include <linux/smp.h>
> > #include <linux/atomic.h>
> > +#include <linux/sizes.h>
> > +#include <linux/memblock.h>
> > #include <asm/cpufeatures.h>
> > #include <asm/cpufeature.h>
> > #include <asm/msr-index.h>
> > @@ -338,6 +340,91 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *tdsysinfo,
> > return check_cmrs(cmr_array, actual_cmr_num);
> > }
> >
> > +/*
> > + * Skip the memory region below 1MB. Return true if the entire
> > + * region is skipped. Otherwise, the updated range is returned.
> > + */
> > +static bool pfn_range_skip_lowmem(unsigned long *p_start_pfn,
> > + unsigned long *p_end_pfn)
> > +{
> > + u64 start, end;
> > +
> > + start = *p_start_pfn << PAGE_SHIFT;
> > + end = *p_end_pfn << PAGE_SHIFT;
> > +
> > + if (start < SZ_1M)
> > + start = SZ_1M;
> > +
> > + if (start >= end)
> > + return true;
> > +
> > + *p_start_pfn = (start >> PAGE_SHIFT);
> > +
> > + return false;
> > +}
> > +
> > +/*
> > + * Walks over all memblock memory regions that are intended to be
> > + * converted to TDX memory. Essentially, it is all memblock memory
> > + * regions excluding the low memory below 1MB.
> > + *
> > + * This is because on some TDX platforms the low memory below 1MB is
> > + * not included in CMRs. Excluding the low 1MB can still guarantee
> > + * that the pages managed by the page allocator are always TDX memory,
> > + * as the low 1MB is reserved during kernel boot and won't end up to
> > + * the ZONE_DMA (see reserve_real_mode()).
> > + */
> > +#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid) \
> > + for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid) \
> > + if (!pfn_range_skip_lowmem(p_start, p_end))
>
> Let's summarize where we are at this point:
>
> 1. All RAM is described in memblocks
> 2. Some memblocks are reserved and some are free
> 3. The lower 1MB is marked reserved
> 4. for_each_mem_pfn_range() walks all reserved and free memblocks, so we
> have to exclude the lower 1MB as a special case.
>
> That seems superficially rather ridiculous. Shouldn't we just pick a
> memblock iterator that skips the 1MB? Surely there is such a thing.

Perhaps you are suggesting we should always loop the _free_ ranges so we don't
need to care about the first 1MB which is reserved?

The problem is some reserved memory regions are actually later freed to the page
allocator, for example, initrd. So to cover all those 'late-freed-reserved-
regions', I used for_each_mem_pfn_range(), instead of for_each_free_mem_range().

Btw, I do have a checkpatch warning around this code:

ERROR: Macros with complex values should be enclosed in parentheses
#109: FILE: arch/x86/virt/vmx/tdx/tdx.c:377:
+#define memblock_for_each_tdx_mem_pfn_range(i, p_start, p_end, p_nid) \
+ for_each_mem_pfn_range(i, MAX_NUMNODES, p_start, p_end, p_nid) \
+ if (!pfn_range_skip_lowmem(p_start, p_end))

But it looks like a false positive to me.

> Or, should we be doing something different with the 1MB in the memblock
> structure?

memblock APIs are used by other kernel components. I don't think we should
modify memblock code behaviour for TDX. Do you have any specific suggestion?

One possible option I can think is explicitly "register" memory regions as TDX
memory when they are firstly freed to the page allocator. Those regions
includes: 

1) memblock_free_all();

Where majority of the pages are freed to page allocator from memblock.

2) memblock_free_late();

Which covers regions freed to page allocator after 1).

3) free_init_pages();

Which is explicitly used for some reserved areas such as initrd and part of
kernel image.

This will require new data structures to represent TDX memblock and the code to
create, insert and merge contiguous TDX memblocks, etc. The advantage is we can
just iterate those TDX memblocks when constructing TDMRs.


--
Thanks,
-Kai