Re: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups

From: Oscar Salvador
Date: Tue Jun 08 2021 - 05:42:57 EST


On Mon, Jun 07, 2021 at 09:54:18PM +0200, David Hildenbrand wrote:
> Hi,
>
> this series aims at improving in-kernel auto-online support. It tackles the
> fundamental problems that:

Hi David,

the idea sounds good to me, and I like that this series takes away part of the
responsability from the user to know where the memory should go.
I think the kernel is a much better fit for that as it has all the required
information to balance things.

I also glanced over the series and besides some things here and there the
whole approach looks sane.
I plan to have a look into it in a few days, just have some high level questions
for the time being:

> 1) We can create zone imbalances when onlining all memory blindly to
> ZONE_MOVABLE, in the worst case crashing the system. We have to know
> upfront how much memory we are going to hotplug such that we can
> safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
> via "online_movable". This is far from practical and only applicable in
> limited setups -- like inside VMs under the RHV/oVirt hypervisor which
> will never hotplug more than 3 times the boot memory (and the
> limitation is only in place due to the Linux limitation).

Could you give more insight about the problems created by zone imbalances (e.g:
a lot of movable memory and little kernel memory).

> 2) We see more setups that implement dynamic VM resizing, hot(un)plugging
> memory to resize VM memory. In these setups, we might hotplug a lot of
> memory, but it might happen in various small steps in both directions
> (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
> primary driver of this upstream right now, performing such dynamic
> resizing NUMA-aware via multiple virtio-mem devices.
>
> Onlining all hotplugged memory to ZONE_NORMAL means we basically have
> no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
> easily run into zone imbalances when growing a VM. We want a mixture,
> and we want as much memory as reasonable/configured in ZONE_MOVABLE.
>
> 3) Memory devices consist of 1..X memory block devices, however, the
> kernel doesn't really track the relationship. Consequently, also user
> space has no idea. We want to make per-device decisions. As one
> example, for memory hotunplug it doesn't make sense to use a mixture of
> zones within a single DIMM: we want all MOVABLE if possible, otherwise
> all !MOVABLE, because any !MOVABLE part will easily block the DIMM from
> getting hotunplugged. As another example, virtio-mem operates on
> individual units that span 1..X memory blocks. Similar to a DIMM, we
> want a unit to either be all MOVABLE or !MOVABLE. Further, we want
> as much memory of a virtio-mem device to be MOVABLE as possible.

So, a virtio-mem unit could be seen as DIMM right?

> 4) We want memory onlining to be done right from the kernel while adding
> memory; for example, this is reqired for fast memory hotplug for
> drivers that add individual memory blocks, like virito-mem. We want a
> way to configure a policy in the kernel and avoid implementing advanced
> policies in user space.

"we want memory onlining to be done right from the kernel while adding memory"

is not that always the case when a driver adds memory? User has no interaction
with that right?

> The auto-onlining support we have in the kernel is not sufficient. All we
> have is a) online everything movable (online_movable) b) online everything
> !movable (online_kernel) c) keep zones contiguous (online). This series
> allows configuring c) to mean instead "online movable if possible according
> to the coniguration, driven by a maximum MOVABLE:KERNEL ratio" -- a new
> onlining policy.
>
> This series does 3 things:
>
> 1) Introduces the "auto-movable" online policy that initially operates on
> individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
> to make a decision whether a memory block will be onlined to
> ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
> memory does not allow for more MOVABLE memory (details in the
> patches). CMA memory is treated like MOVABLE memory.

How a user would know which ratio is sane? Could we add some info in the
Docu part that kinda sets some "basic" rules?

> 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
> groups and uses group information to make decisions in the
> "auto-movable" online policy accross memory blocks of a single memory
> device (modeled as memory group).

So, the distinction being that a DIMM cannot grow larger but we can add more
memory to a virtio-mem unit? I feel I am missing some insight here.

> 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
> allowing ZONE_NORMAL memory within a dynamic memory group to allow for
> more ZONE_MOVABLE memory within the same memory group. The target use
> case is dynamic VM resizing using virtio-mem.

Sorry, I got lost in this one. Care to explain a bit more?

> The target usage will be:
>
> 1) Linux boots with "mhp_default_online_type=offline"
>
> 2) User space (e.g., systemd unit) configures memory onlining (according
> to a config file and system properties), for example:
> * Setting memory_hotplug.online_policy=auto-movable
> * Setting memory_hotplug.auto_movable_ratio=301
> * Setting memory_hotplug.auto_movable_numa_aware=true

I think we would need to document those in order to let the user know what
it is best for them. e.g: when do we want to enable auto_movable_numa_aware etc.

> For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
> 301% results in the following layout:
> Memory block 1-15: DMA32 (early)
> Memory block 32-47: Normal (early)
> Memory block 48-79: Movable (DIMM 0)
> Memory block 80-111: Movable (DIMM 1)
> Memory block 112-143: Movable (DIMM 2)
> Memory block 144-275: Normal (DIMM 3)
> Memory block 176-207: Normal (DIMM 4)
> ... all Normal
> (-> hotplugged Normal memory does not allow for more Movable memory)

Uhm, I am sorry for being dense here:

On x86_64, 4GB = 32 sections (of 128MB each). Why the memblock span from #1 to #47?

> For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
> will result in the following layout:
> Memory block 1-15: DMA32 (early)
> Memory block 32-47: Normal (early)
> Memory block 48-143: Movable (virtio-mem, first 12 GiB)
> Memory block 144: Normal (virtio-mem, next 128 MiB)
> Memory block 145-147: Movable (virtio-mem, next 384 MiB)
> Memory block 148: Normal (virtio-mem, next 128 MiB)
> Memory block 149-151: Movable (virtio-mem, next 384 MiB)
> ... Normal/Movable mixture as above
> (-> hotplugged Normal memory allows for more Movable memory within
> the same device)
>
> Which gives us maximum flexibility when dynamically growing/shrinking a
> VM in smaller steps. When shrinking, virtio-mem will prioritize unplug of
> MOVABLE memory with [1] sent last week, such that we won't accidentially
> trigger zone imbalances in more complicated setups that involve multiple
> virtio-mem devices.

--
Oscar Salvador
SUSE L3