Re: [PATCH RFC] x86/xen: attempt to inflate the memory balloon on PVH

From: Stefano Stabellini
Date: Wed Mar 06 2024 - 20:51:12 EST


On Tue, 5 Mar 2024, Roger Pau Monné wrote:
> On Thu, Feb 22, 2024 at 05:16:09PM -0800, Stefano Stabellini wrote:
> > On Tue, 20 Feb 2024, Roger Pau Monne wrote:
> > > When running as PVH or HVM Linux will use holes in the memory map as scratch
> > > space to map grants, foreign domain pages and possibly miscellaneous other
> > > stuff. However the usage of such memory map holes for Xen purposes can be
> > > problematic. The request of holesby Xen happen quite early in the kernel boot
> > > process (grant table setup already uses scratch map space), and it's possible
> > > that by then not all devices have reclaimed their MMIO space. It's not
> > > unlikely for chunks of Xen scratch map space to end up using PCI bridge MMIO
> > > window memory, which (as expected) causes quite a lot of issues in the system.
> >
> > Am I understanding correctly that XEN_BALLOON_MEMORY_HOTPLUG doesn't
> > help because it becomes available too late in the PVH boot sequence?
>
> No, not really, the hoptplug mechanism is available as early as the
> balloon driver requires, the issue is that when Linux starts making
> use of such unpopulated ranges (for example in order to map the shared
> info page) many drivers have not yet reserved their MMIO regions, and so it's
> not uncommon for the balloon driver to end up using address ranges that
> would otherwise be used by device BARs for example.
>
> This causes havoc, Linux starts to reposition device BARs, sometimes
> it can manage to re-position them, otherwise some devices are not
> usable.

OK this is bad


> > > At least for PVH dom0 we have the possibility of using regions marked as
> > > UNUSABLE in the e820 memory map. Either if the region is UNUSABLE in the
> > > native memory map, or it has been converted into UNUSABLE in order to hide RAM
> > > regions from dom0, the second stage translation page-tables can populate those
> > > areas without issues.
> > >
> > > PV already has this kind of logic, where the balloon driver is inflated at
> > > boot. Re-use the current logic in order to also inflate it when running as
> > > PVH. onvert UNUSABLE regions up to the ratio specified in EXTRA_MEM_RATIO to
> > > RAM, while reserving them using xen_add_extra_mem() (which is also moved so
> > > it's no longer tied to CONFIG_PV).
> > >
> > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> > > ---
> > > RFC reasons:
> > >
> > > * Note that it would be preferred for the hypervisor to provide an explicit
> > > range to be used as scratch mapping space, but that requires changes to Xen,
> > > and it's not fully clear whether Xen can figure out the position of all MMIO
> > > regions at boot in order to suggest a scratch mapping region for dom0.
> > >
> > > * Should the whole set of xen_{add,del,chk,inv}_extra_mem() functions be moved
> > > to a different file? For the purposes of PVH only xen_add_extra_mem() is
> > > moved and the chk and inv ones are PV specific and might not want moving to
> > > a separate file just to guard them with CONFIG_PV.
> > > ---
> > > arch/x86/include/asm/xen/hypervisor.h | 1 +
> > > arch/x86/platform/pvh/enlighten.c | 3 ++
> > > arch/x86/xen/enlighten.c | 32 +++++++++++++
> > > arch/x86/xen/enlighten_pvh.c | 68 +++++++++++++++++++++++++++
> > > arch/x86/xen/setup.c | 44 -----------------
> > > arch/x86/xen/xen-ops.h | 14 ++++++
> > > drivers/xen/balloon.c | 2 -
> > > 7 files changed, 118 insertions(+), 46 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/xen/hypervisor.h b/arch/x86/include/asm/xen/hypervisor.h
> > > index a9088250770f..31e2bf8d5db7 100644
> > > --- a/arch/x86/include/asm/xen/hypervisor.h
> > > +++ b/arch/x86/include/asm/xen/hypervisor.h
> > > @@ -62,6 +62,7 @@ void xen_arch_unregister_cpu(int num);
> > > #ifdef CONFIG_PVH
> > > void __init xen_pvh_init(struct boot_params *boot_params);
> > > void __init mem_map_via_hcall(struct boot_params *boot_params_p);
> > > +void __init xen_reserve_extra_memory(struct boot_params *bootp);
> > > #endif
> > >
> > > /* Lazy mode for batching updates / context switch */
> > > diff --git a/arch/x86/platform/pvh/enlighten.c b/arch/x86/platform/pvh/enlighten.c
> > > index 00a92cb2c814..a12117f3d4de 100644
> > > --- a/arch/x86/platform/pvh/enlighten.c
> > > +++ b/arch/x86/platform/pvh/enlighten.c
> > > @@ -74,6 +74,9 @@ static void __init init_pvh_bootparams(bool xen_guest)
> > > } else
> > > xen_raw_printk("Warning: Can fit ISA range into e820\n");
> > >
> > > + if (xen_guest)
> > > + xen_reserve_extra_memory(&pvh_bootparams);
> > > +
> > > pvh_bootparams.hdr.cmd_line_ptr =
> > > pvh_start_info.cmdline_paddr;
> > >
> > > diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> > > index 3c61bb98c10e..a01ca255b0c6 100644
> > > --- a/arch/x86/xen/enlighten.c
> > > +++ b/arch/x86/xen/enlighten.c
> > > @@ -6,6 +6,7 @@
> > > #include <linux/console.h>
> > > #include <linux/cpu.h>
> > > #include <linux/kexec.h>
> > > +#include <linux/memblock.h>
> > > #include <linux/slab.h>
> > > #include <linux/panic_notifier.h>
> > >
> > > @@ -350,3 +351,34 @@ void xen_arch_unregister_cpu(int num)
> > > }
> > > EXPORT_SYMBOL(xen_arch_unregister_cpu);
> > > #endif
> > > +
> > > +/* Amount of extra memory space we add to the e820 ranges */
> > > +struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS] __initdata;
> > > +
> > > +void __init xen_add_extra_mem(unsigned long start_pfn, unsigned long n_pfns)
> > > +{
> > > + unsigned int i;
> > > +
> > > + /*
> > > + * No need to check for zero size, should happen rarely and will only
> > > + * write a new entry regarded to be unused due to zero size.
> > > + */
> > > + for (i = 0; i < XEN_EXTRA_MEM_MAX_REGIONS; i++) {
> > > + /* Add new region. */
> > > + if (xen_extra_mem[i].n_pfns == 0) {
> > > + xen_extra_mem[i].start_pfn = start_pfn;
> > > + xen_extra_mem[i].n_pfns = n_pfns;
> > > + break;
> > > + }
> > > + /* Append to existing region. */
> > > + if (xen_extra_mem[i].start_pfn + xen_extra_mem[i].n_pfns ==
> > > + start_pfn) {
> > > + xen_extra_mem[i].n_pfns += n_pfns;
> > > + break;
> > > + }
> > > + }
> > > + if (i == XEN_EXTRA_MEM_MAX_REGIONS)
> > > + printk(KERN_WARNING "Warning: not enough extra memory regions\n");
> > > +
> > > + memblock_reserve(PFN_PHYS(start_pfn), PFN_PHYS(n_pfns));
> > > +}
> > > diff --git a/arch/x86/xen/enlighten_pvh.c b/arch/x86/xen/enlighten_pvh.c
> > > index ada3868c02c2..c28f073c1df5 100644
> > > --- a/arch/x86/xen/enlighten_pvh.c
> > > +++ b/arch/x86/xen/enlighten_pvh.c
> > > @@ -1,6 +1,7 @@
> > > // SPDX-License-Identifier: GPL-2.0
> > > #include <linux/acpi.h>
> > > #include <linux/export.h>
> > > +#include <linux/mm.h>
> > >
> > > #include <xen/hvc-console.h>
> > >
> > > @@ -72,3 +73,70 @@ void __init mem_map_via_hcall(struct boot_params *boot_params_p)
> > > }
> > > boot_params_p->e820_entries = memmap.nr_entries;
> > > }
> > > +
> > > +/*
> > > + * Reserve e820 UNUSABLE regions to inflate the memory balloon.
> > > + *
> > > + * On PVH dom0 the host memory map is used, RAM regions available to dom0 are
> > > + * located as the same place as in the native memory map, but since dom0 gets
> > > + * less memory than the total amount of host RAM the ranges that can't be
> > > + * populated are converted from RAM -> UNUSABLE. Use such regions (up to the
> > > + * ratio signaled in EXTRA_MEM_RATIO) in order to inflate the balloon driver at
> > > + * boot. Doing so prevents the guest (even if just temporary) from using holes
> > > + * in the memory map in order to map grants or foreign addresses, and
> > > + * hopefully limits the risk of a clash with a device MMIO region. Ideally the
> > > + * hypervisor should notify us which memory ranges are suitable for creating
> > > + * foreign mappings, but that's not yet implemented.
> > > + */
> > > +void __init xen_reserve_extra_memory(struct boot_params *bootp)
> > > +{
> > > + unsigned int i, ram_pages = 0, extra_pages;
> > > +
> > > + for (i = 0; i < bootp->e820_entries; i++) {
> > > + struct boot_e820_entry *e = &bootp->e820_table[i];
> > > +
> > > + if (e->type != E820_TYPE_RAM)
> > > + continue;
> > > + ram_pages += PFN_DOWN(e->addr + e->size) - PFN_UP(e->addr);
> > > + }
> > > +
> > > + /* Max amount of extra memory. */
> > > + extra_pages = EXTRA_MEM_RATIO * ram_pages;
> > > +
> > > + /*
> > > + * Convert UNUSABLE ranges to RAM and reserve them for foreign mapping
> > > + * purposes.
> > > + */
> > > + for (i = 0; i < bootp->e820_entries && extra_pages; i++) {
> > > + struct boot_e820_entry *e = &bootp->e820_table[i];
> > > + unsigned long pages;
> > > +
> > > + if (e->type != E820_TYPE_UNUSABLE)
> > > + continue;
> > > +
> > > + pages = min(extra_pages,
> > > + PFN_DOWN(e->addr + e->size) - PFN_UP(e->addr));
> > > +
> > > + if (pages != (PFN_DOWN(e->addr + e->size) - PFN_UP(e->addr))) {
> > > + struct boot_e820_entry *next;
> > > +
> > > + if (bootp->e820_entries ==
> > > + ARRAY_SIZE(bootp->e820_table))
> > > + /* No space left to split - skip region. */
> > > + continue;
> > > +
> > > + /* Split entry. */
> > > + next = e + 1;
> > > + memmove(next, e,
> > > + (bootp->e820_entries - i) * sizeof(*e));
> > > + bootp->e820_entries++;
> > > + next->addr = PAGE_ALIGN(e->addr) + PFN_PHYS(pages);
> > > + e->size = next->addr - e->addr;
> > > + next->size -= e->size;
> >
> > Is this really worth doing? Can we just skip this range and continue or
> > simply break out and call it a day? Or even add the whole range instead?
> >
> > The reason I am asking is that I am expecting E820_TYPE_UNUSABLE regions
> > not to be huge. Splitting one just to cover the few remaining pages out
> > of extra_pages doesn't seem worth it?
>
> No, they are usually quite huge on PVH dom0, because when building a
> PVH dom0 the E820_TYPE_RAM ranges that are not made available to dom0
> because of a dom0_mem option end up being reported as
> E820_TYPE_UNUSABLE in the e820 provided to dom0.
>
> That's mostly the motivation of the change, to be able to reuse those
> ranges as scratch space for foreign mappings.
>
> Ideally the hypervisor should somehow report suitable ranges in the
> address space for domains to create foreign mappings, but this does
> require an amount of extra work I don't have time to do ATM, hence
> this stopgap proposal.

I see. We have gained this feature on ARM not long ago for Dom0 and
Dom0less guests.

All right, I have no reservations. The patch looks OK to me. Juergen?