Re: [PATCH v2 1/7] mm,memory_hotplug: Allocate memmap from the added memory range

From: David Hildenbrand
Date: Thu Feb 25 2021 - 14:01:25 EST


On 09.02.21 14:38, Oscar Salvador wrote:
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This comes in handy as in {online,offline}_pages, all the isolation and
migration is being done on (buddy_start_pfn, end_pfn] range,
being buddy_start_pfn = start_pfn + nr_vmemmap_pages.

In this way, we have:

(start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
(buddy_start_pfn, end_pfn] = Initialized and sent to buddy

nit: shouldn't it be

[start_pfn, buddy_start_pfn - 1]
[buddy_start_pfn, end_pfn - 1]

or

[start_pfn, buddy_start_pfn)
[buddy_start_pfn, end_pfn)

(I remember that "[" means inclusive and "(" means exclusive, I might be wrong :) )

I actually prefer the first variant.


Hot-remove:

We need to be careful when removing memory, as adding and
removing memory needs to be done with the same granularity.
To check that this assumption is not violated, we check the
memory range we want to remove and if a) any memory block has
vmemmap pages and b) the range spans more than a single memory
block, we scream out loud and refuse to proceed.

If all is good and the range was using memmap on memory (aka vmemmap pages),
we construct an altmap structure so free_hugepage_table does the right
thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>

[...]

/*
* online_page_callback contains pointer to current page onlining function.
* Initially it is generic_online_page(). If it is required it could be
@@ -638,10 +640,20 @@ void generic_online_page(struct page *page, unsigned int order)
}
EXPORT_SYMBOL_GPL(generic_online_page);
-static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages)
+static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
+ unsigned long buddy_start_pfn)
{
const unsigned long end_pfn = start_pfn + nr_pages;
- unsigned long pfn;
+ unsigned long pfn = buddy_start_pfn;
+
+ /*
+ * When using memmap_on_memory, the range might be unaligned as the
+ * first pfns are used for vmemmap pages. Align it in case we need to.
+ */
+ if (pfn & ((1 << (MAX_ORDER - 1)) - 1)) {

if (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES))

+ (*online_page_callback)(pfn_to_page(pfn), pageblock_order);
+ pfn += 1 << pageblock_order;

pfn += pageblock_nr_pages;

Can you add a comment why we can be sure that we are off by a single pageblock? What about s390x where a MAX_ORDER_NR_PAGES == 4 * pageblock_nr_pages?

Would it make thing simpler to just do a

while (!IS_ALIGNED(pfn, MAX_ORDER_NR_PAGES)) {
(*online_page_callback)(pfn_to_page(pfn), 0);
pfn++;
}

[...]


/*
* Freshly onlined pages aren't shuffled (e.g., all pages are placed to
@@ -1064,6 +1085,12 @@ static int online_memory_block(struct memory_block *mem, void *arg)
return device_online(&mem->dev);
}
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+ return memmap_on_memory_enabled &&
+ size == memory_block_size_bytes();

Regarding my other comments as reply to the other patches, I'd move all magic you have when trying to enable right here.


-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+ unsigned long nr_vmemmap_pages)
{
const unsigned long end_pfn = start_pfn + nr_pages;
- unsigned long pfn, system_ram_pages = 0;
+ unsigned long pfn, buddy_start_pfn, buddy_nr_pages, system_ram_pages = 0;
unsigned long flags;
struct zone *zone;
struct memory_notify arg;
@@ -1578,6 +1620,9 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
+ buddy_start_pfn = start_pfn + nr_vmemmap_pages;
+ buddy_nr_pages = nr_pages - nr_vmemmap_pages;
+
mem_hotplug_begin();
/*
@@ -1613,7 +1658,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
zone_pcp_disable(zone);
/* set above range as isolated */
- ret = start_isolate_page_range(start_pfn, end_pfn,
+ ret = start_isolate_page_range(buddy_start_pfn, end_pfn,
MIGRATE_MOVABLE,
MEMORY_OFFLINE | REPORT_FAILURE);

Did you take care to properly adjust undo_isolate_page_range() as well? I can't spot it.


if (ret) {
@@ -1633,7 +1678,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
}
do {
- pfn = start_pfn;
+ pfn = buddy_start_pfn;
do {
if (signal_pending(current)) {
ret = -EINTR;
@@ -1664,18 +1709,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
* offlining actually in order to make hugetlbfs's object
* counting consistent.
*/
- ret = dissolve_free_huge_pages(start_pfn, end_pfn);
+ ret = dissolve_free_huge_pages(buddy_start_pfn, end_pfn);
if (ret) {
reason = "failure to dissolve huge pages";
goto failed_removal_isolated;
}
- ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
+ ret = test_pages_isolated(buddy_start_pfn, end_pfn, MEMORY_OFFLINE);
} while (ret);
/* Mark all sections offline and remove free pages from the buddy. */
- __offline_isolated_pages(start_pfn, end_pfn);
+ __offline_isolated_pages(start_pfn, end_pfn, buddy_start_pfn);> pr_debug("Offlined Pages %ld\n", nr_pages);
/*
@@ -1684,13 +1729,13 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
* of isolated pageblocks, memory onlining will properly revert this.
*/
spin_lock_irqsave(&zone->lock, flags);
- zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
+ zone->nr_isolate_pageblock -= buddy_nr_pages / pageblock_nr_pages;
spin_unlock_irqrestore(&zone->lock, flags);
zone_pcp_enable(zone);
/* removal success */
- adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
+ adjust_managed_page_count(pfn_to_page(start_pfn), -buddy_nr_pages);
zone->present_pages -= nr_pages;
pgdat_resize_lock(zone->zone_pgdat, &flags);
@@ -1750,6 +1795,14 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
return 0;
}

[...]

+static int get_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
+{
+ unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
+
+ *nr_vmemmap_pages += mem->nr_vmemmap_pages;
+ return mem->nr_vmemmap_pages;
+}
+

I think you can do this easier, all you want to know is if there
is any block that has nr_vmemmap_pages set - and return the value.

static int first_set_nr_vmemmap_pages_cb(struct memory_block *mem, void *arg)
{
/* If not set, continue with the next block. */
return mem->nr_vmemmap_pages;
}

...
+ walk_memory_blocks(start, size, &nr_vmemmap_pages,
+ get_nr_vmemmap_pages_cb);

...

mem->nr_vmemmap_pages = walk_memory_blocks(start ...)



Looks quite promising, only a couple of things to fine-tune :) Thanks!

--
Thanks,

David / dhildenb