Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization

From: Gang Li
Date: Mon Jan 22 2024 - 21:40:45 EST


On 2024/1/22 19:30, Muchun Song wrote:
On Jan 22, 2024, at 18:12, Gang Li <gang.li@xxxxxxxxx> wrote:

On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
+static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg)
{
- unsigned long i;
+ struct hstate *h = (struct hstate *)arg;
+ int i, num = end - start;
+ nodemask_t node_alloc_noretry;
+ unsigned long flags;
+ int next_node = 0;
This should be first_online_node which may be not zero.

That's right. Thanks!

- for (i = 0; i < h->max_huge_pages; ++i) {
- if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+ /* Bit mask controlling how hard we retry per-node allocations.*/
+ nodes_clear(node_alloc_noretry);
+
+ for (i = 0; i < num; ++i) {
+ struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+ &node_alloc_noretry, &next_node);
+ if (!folio)
break;
+ spin_lock_irqsave(&hugetlb_lock, flags);
I suspect there will more contention on this lock when parallelizing.

In the worst case, there are only 'numa node number' of threads in
contention. And in my testing, it doesn't degrade performance, but
rather improves performance due to the reduced granularity.

So, the performance does not change if you move the lock out of
loop?


If we move the lock out of loop, then multi-threading becomes single-threading, which definitely reduces performance.

```
+ spin_lock_irqsave(&hugetlb_lock, flags);
for (i = 0; i < num; ++i) {
struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
&node_alloc_noretry, &next_node);
if (!folio)
break;
- spin_lock_irqsave(&hugetlb_lock, flags);
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
- spin_unlock_irqrestore(&hugetlb_lock, flags);
cond_resched();
}
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}
```