Re: [PATCH 0/4] mm,memory_hotplug: allocate memmap from hotadded memory

From: David Hildenbrand
Date: Thu Mar 28 2019 - 11:31:50 EST


On 28.03.19 16:09, David Hildenbrand wrote:
> On 28.03.19 14:43, Oscar Salvador wrote:
>> Hi,
>>
>> since last two RFCs were almost unnoticed (thanks David for the feedback),
>> I decided to re-work some parts to make it more simple and give it a more
>> testing, and drop the RFC, to see if it gets more attention.
>> I also added David's feedback, so now all users of add_memory/__add_memory/
>> add_memory_resource can specify whether they want to use this feature or not.
>
> Terrific, I will also definetly try to make use of that in the next
> virito-mem prototype (looks like I'll finally have time to look into it
> again).
>
>> I also fixed some compilation issues when CONFIG_SPARSEMEM_VMEMMAP is not set.
>>
>> [Testing]
>>
>> Testing has been carried out on the following platforms:
>>
>> - x86_64 (small and big memblocks)
>> - powerpc
>> - arm64 (Huawei's fellows)
>>
>> I plan to test it on Xen and Hyper-V, but for now those two will not be
>> using this feature, and neither DAX/pmem.
>
> I think doing it step by step is the right approach. Less likely to
> break stuff.
>
>>
>> Of course, if this does not find any strong objection, my next step is to
>> work on enabling this on Xen/Hyper-V.
>>
>> [Coverletter]
>>
>> This is another step to make the memory hotplug more usable. The primary
>> goal of this patchset is to reduce memory overhead of the hot added
>> memory (at least for SPARSE_VMEMMAP memory model). The current way we use
>> to populate memmap (struct page array) has two main drawbacks:
>>
>> a) it consumes an additional memory until the hotadded memory itself is
>> onlined and
>> b) memmap might end up on a different numa node which is especially true
>> for movable_node configuration.
>>
>> a) is problem especially for memory hotplug based memory "ballooning"
>> solutions when the delay between physical memory hotplug and the
>> onlining can lead to OOM and that led to introduction of hacks like auto
>> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>> policy for the newly added memory")).
>>
>> b) can have performance drawbacks.
>>
>> I have also seen hot-add operations failing on archs because they
>> were running out of order-x pages.
>> E.g On powerpc, in certain configurations, we use order-8 pages,
>> and given 64KB base pagesize, that is 16MB.
>> If we run out of those, we just fail the operation and we cannot add
>> more memory.
>> We could fallback to base pages as x86_64 does, but we can do better.
>>
>> One way to mitigate all these issues is to simply allocate memmap array
>> (which is the largest memory footprint of the physical memory hotplug)
>> from the hotadded memory itself. VMEMMAP memory model allows us to map
>> any pfn range so the memory doesn't need to be online to be usable
>> for the array. See patch 3 for more details. In short I am reusing an
>> existing vmem_altmap which wants to achieve the same thing for nvdim
>> device memory.
>>
>> There is also one potential drawback, though. If somebody uses memory
>> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
>> for them obviously because each memory block will contain reserved
>> area. Large x86 machines will use 2G memblocks so at least one 1G page
>> will be available but this is still not 2G...
>>
>> If that is a problem, we can always configure a fallback strategy to
>> use the current scheme.
>>
>> Since this only works when CONFIG_VMEMMAP_ENABLED is set,
>> we do check for it before setting the flag that allows use
>> to use the feature, no matter if the user wanted it.
>>
>> [Overall design]:
>>
>> Let us say we hot-add 2GB of memory on a x86_64 (memblock size = 128M).
>> That is:
>>
>> - 16 sections
>> - 524288 pages
>> - 8192 vmemmap pages (out of those 524288. We spend 512 pages for each section)
>>
>> The range of pages is: 0xffffea0004000000 - 0xffffea0006000000
>> The vmemmap range is: 0xffffea0004000000 - 0xffffea0004080000
>>
>> 0xffffea0004000000 is the head vmemmap page (first page), while all the others
>> are "tails".
>>
>> We keep the following information in it:
>>
>> - Head page:
>> - head->_refcount: number of sections
>> - head->private : number of vmemmap pages
>> - Tail page:
>> - tail->freelist : pointer to the head
>>
>> This is done because it eases the work in cases where we have to compute the
>> number of vmemmap pages to know how much do we have to skip etc, and to keep
>> the right accounting to present_pages.
>>
>> When we want to hot-remove the range, we need to be careful because the first
>> pages of that range, are used for the memmap maping, so if we remove those
>> first, we would blow up while accessing the others later on.
>> For that reason we keep the number of sections in head->_refcount, to know how
>> much do we have to defer the free up.
>>
>> Since in a hot-remove operation, sections are being removed sequentially, the
>> approach taken here is that every time we hit free_section_memmap(), we decrease
>> the refcount of the head.
>> When it reaches 0, we know that we hit the last section, so we call
>> vmemmap_free() for the whole memory-range in backwards, so we make sure that
>> the pages used for the mapping will be latest to be freed up.
>>
>> Vmemmap pages are charged to spanned/present_paged, but not to manages_pages.
>>
>
> I guess one important thing to mention is that it is no longer possible
> to remove memory in a different granularity it was added. I slightly
> remember that ACPI code sometimes "reuses" parts of already added
> memory. We would have to validate that this can indeed not be an issue.
>
> drivers/acpi/acpi_memhotplug.c:
>
> result = __add_memory(node, info->start_addr, info->length);
> if (result && result != -EEXIST)
> continue;
>
> What would happen when removing this dimm (->remove_memory())
>
>
> Also have a look at
>
> arch/powerpc/platforms/powernv/memtrace.c
>
> I consider it evil code. It will simply try to offline+unplug *some*
> memory it finds in *some granularity*. Not sure if this might be
> problematic-
>
> Would there be any "safety net" for adding/removing memory in different
> granularities?
>

Correct me if I am wrong. I think I was confused - vmemmap data is still
allocated *per memory block*, not for the whole added memory, correct?

--

Thanks,

David / dhildenb