[PATCH v6 0/8] hugetlb: parallelize hugetlb page init on boot

From: Gang Li
Date: Thu Feb 22 2024 - 09:05:14 EST


Hi all, hugetlb init parallelization has now been updated to v6.

This version is tested on mm/mm-stable.

Since the release of v5, there have been some scattered discussions, they have
primarily centered around two issues. Both of these issues have now been
resolved, leading to the release of v6.

Two updates in v6
-----------------
- Fix a Kconfig warning
hugetlb parallelization depends on PADATA, and PADATA depends on SMP. When SMP
is not selected, selecting PADATA will cause a warning: "WARNING: unmet direct
dependencies detected for PADATA". So HUGETLBFS can only select PADATA when
SMP is set.

padata.c will not be compiled if !SMP, but padata_do_multithreaded is still
used in this series for hugetlb parallel init. So it is necessary to implement
a serial version in padata.h.

- Fix a potential bug in gather_bootmem_prealloc_node
padata_do_multithreaded implementation guarantees that each
gather_bootmem_prealloc_node task handles one node. However, the API described
in padata_do_multithreaded comment indicates that padata_do_multithreaded also
can assign multiple nodes to a gather_bootmem_prealloc_node task.

To avoid potential bug from future changes in padata_do_multithreaded,
gather_bootmem_prealloc_parallel is introduced to wrap the
gather_bootmem_prealloc_node.

More details in: https://lore.kernel.org/r/20240213111347.3189206-3-gang.li@xxxxxxxxx

Introduction
------------
Hugetlb initialization during boot takes up a considerable amount of time.
For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2
seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel
host takes more than 1 minute[1]. This is a noteworthy figure.

Inspired by [2] and [3], hugetlb initialization can also be accelerated
through parallelization. Kernel already has infrastructure like
padata_do_multithreaded, this patch uses it to achieve effective results
by minimal modifications.

[1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@xxxxxxxxxx/
[2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@xxxxxxxxxx/
[3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@xxxxxxxxxxxxx/
[4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@xxxxxxxxxx/

max_threads
-----------
This patch use `padata_do_multithreaded` like this:

```
job.max_threads = num_node_state(N_MEMORY) * multiplier;
padata_do_multithreaded(&job);
```

To fully utilize the CPU, the number of parallel threads needs to be
carefully considered. `max_threads = num_node_state(N_MEMORY)` does
not fully utilize the CPU, so we need to multiply it by a multiplier.

Tests below indicate that a multiplier of 2 significantly improves
performance, and although larger values also provide improvements,
the gains are marginal.

multiplier 1 2 3 4 5
------------ ------- ------- ------- ------- -------
256G 2node 358ms 215ms 157ms 134ms 126ms
2T 4node 979ms 679ms 543ms 489ms 481ms
50G 2node 71ms 44ms 37ms 30ms 31ms

Therefore, choosing 2 as the multiplier strikes a good balance between
enhancing parallel processing capabilities and maintaining efficient
resource management.

Test result
-----------
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%

256c2T(4 node) 2M 3336 1051 68.52%
128c1T(2 node) 2M 1943 716 63.15%

Change log
----------
Changes in v6:
- Fix a Kconfig warning
- Fix a potential bug in gather_bootmem_prealloc_node

Changes in v5:
- https://lore.kernel.org/lkml/20240126152411.1238072-1-gang.li@xxxxxxxxx/
- Use prep_and_add_allocated_folios in 2M hugetlb parallelization
- Update huge_boot_pages in arch/powerpc/mm/hugetlbpage.c
- Revise struct padata_mt_job comment
- Add 'max_threads' section in cover letter
- Collect more Reviewed-by

Changes in v4:
- https://lore.kernel.org/r/20240118123911.88833-1-gang.li@xxxxxxxxx
- Make padata_do_multithreaded dispatch all jobs with a global iterator
- Revise commit message
- Rename some functions
- Collect Tested-by and Reviewed-by

Changes in v3:
- https://lore.kernel.org/all/20240102131249.76622-1-gang.li@xxxxxxxxx/
- Select CONFIG_PADATA as we use padata_do_multithreaded
- Fix a race condition in h->next_nid_to_alloc
- Fix local variable initialization issues
- Remove RFC tag

Changes in v2:
- https://lore.kernel.org/all/20231208025240.4744-1-gang.li@xxxxxxxxx/
- Reduce complexity with `padata_do_multithreaded`
- Support 1G hugetlb

v1:
- https://lore.kernel.org/all/20231123133036.68540-1-gang.li@xxxxxxxxx/
- parallelize 2M hugetlb initialization with workqueue

Gang Li (8):
hugetlb: code clean for hugetlb_hstate_alloc_pages
hugetlb: split hugetlb_hstate_alloc_pages
hugetlb: pass *next_nid_to_alloc directly to
for_each_node_mask_to_alloc
padata: dispatch works on different nodes
padata: downgrade padata_do_multithreaded to serial execution for
non-SMP
hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA
hugetlb: parallelize 2M hugetlb allocation and initialization
hugetlb: parallelize 1G hugetlb initialization

arch/powerpc/mm/hugetlbpage.c | 2 +-
fs/Kconfig | 1 +
include/linux/hugetlb.h | 2 +-
include/linux/padata.h | 14 +-
kernel/padata.c | 14 +-
mm/hugetlb.c | 241 +++++++++++++++++++++++-----------
mm/mm_init.c | 1 +
7 files changed, 190 insertions(+), 85 deletions(-)

--
2.20.1