Re: [PATCH v2 2/2] mm/memory_hotplug: Reset node's state when empty during offline

From: David Hildenbrand
Date: Tue Jun 21 2022 - 03:59:19 EST


On 21.06.22 06:17, Oscar Salvador wrote:
> All possible nodes are now pre-allocated at boot time by free_area_init()->
> free_area_init_node(), and those which are to be hot-plugged are initialized
> later on by hotadd_init_pgdat()->free_area_init_core_hotplug() when they
> become online.
>
> free_area_init_core_hotplug() calls pgdat_init_internals() and
> zone_init_internals() to initialize some internal data structures
> and zeroes a few pgdat fields.
>
> But we do already call pgdat_init_internals() and zone_init_internals()
> for all possible nodes back in free_area_init_core(), and pgdat fields
> are already zeroed because the pre-allocation memsets with 0s the
> structure, meaning we do not need to repeat the process when
> the node becomes online.
>
> So initialize it only once when booting, and make sure to reset
> the fields we care about to 0 when the node goes empty.
> The only thing we need to check for is to allocate per_cpu_nodestats
> struct the very first time this node goes online.
>
> node_reset_state() is the function in charge of resetting pgdat's fields,
> and it is called when offline_pages() detects that the node becomes empty
> worth of memory.
>
> Signed-off-by: Oscar Salvador <osalvador@xxxxxxx>
> ---
> include/linux/memory_hotplug.h | 2 +-
> mm/memory_hotplug.c | 54 ++++++++++++++++++++--------------
> mm/page_alloc.c | 49 +++++-------------------------
> 3 files changed, 41 insertions(+), 64 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 20d7edf62a6a..917112661b5c 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -319,7 +319,7 @@ extern void set_zone_contiguous(struct zone *zone);
> extern void clear_zone_contiguous(struct zone *zone);
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> -extern void __ref free_area_init_core_hotplug(struct pglist_data *pgdat);
> +extern bool pgdat_has_boot_nodestats(pg_data_t *pgdat);
> extern int __add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> extern int add_memory(int nid, u64 start, u64 size, mhp_t mhp_flags);
> extern int add_memory_resource(int nid, struct resource *resource,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 1213d0c67a53..8a464cdd44ad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1176,18 +1176,18 @@ static void reset_node_present_pages(pg_data_t *pgdat)
> /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> static pg_data_t __ref *hotadd_init_pgdat(int nid)
> {
> - struct pglist_data *pgdat;
> + struct pglist_data *pgdat = NODE_DATA(nid);
>
> /*
> - * NODE_DATA is preallocated (free_area_init) but its internal
> - * state is not allocated completely. Add missing pieces.
> - * Completely offline nodes stay around and they just need
> - * reintialization.
> + * NODE_DATA is preallocated (free_area_init), the only thing missing
> + * is to allocate its per_cpu_nodestats struct and to build node's
> + * zonelists. The allocation of per_cpu_nodestats only needs to be done
> + * the very first time this node is brought up, as we reset its state
> + * when all node's memory goes offline.
> */
> - pgdat = NODE_DATA(nid);
> -
> - /* init node's zones as empty zones, we don't have any present pages.*/
> - free_area_init_core_hotplug(pgdat);
> + if (pgdat_has_boot_nodestats(pgdat))
> + pgdat->per_cpu_nodestats = alloc_percpu_gfp(struct per_cpu_nodestat,
> + __GFP_ZERO);
>
> /*
> * The node we allocated has no zone fallback lists. For avoiding
> @@ -1195,15 +1195,6 @@ static pg_data_t __ref *hotadd_init_pgdat(int nid)
> */
> build_all_zonelists(pgdat);
>
> - /*
> - * When memory is hot-added, all the memory is in offline state. So
> - * clear all zones' present_pages because they will be updated in
> - * online_pages() and offline_pages().
> - * TODO: should be in free_area_init_core_hotplug?
> - */
> - reset_node_managed_pages(pgdat);
> - reset_node_present_pages(pgdat);
> -
> return pgdat;
> }
>
> @@ -1780,6 +1771,26 @@ static void node_states_clear_node(int node, struct memory_notify *arg)
> node_clear_state(node, N_MEMORY);
> }
>
> +static void node_reset_state(int node)
> +{
> + pg_data_t *pgdat = NODE_DATA(node);
> + int cpu;
> +
> + kswapd_stop(node);
> + kcompactd_stop(node);
> +
> + pgdat->nr_zones = 0;

^ what is that? it should be "highest_zone_idx" and I don't see any
reason that we really need this.

To detect if a node is empty we can use pgdat_is_empty(). To detect if a
zone is empty we can use zone_is_empty().

The usage of "pgdat->nr_zones" as an optimization is questionable,
especially when iterating over our handful of zones where most nodes
miss the *lower* zones like ZONE_DMA* in practice and have ZONE_NORMAL.

Can we get rid of that and just check pgdat_is_empty() and
zone_is_empty() and iterate all applicable zones from 0..X?


If it amkes sense what I'm saying, that could be done before this patch.

--
Thanks,

David / dhildenb