Re: [PATCH 1/1] mm/khugepaged: reduce process visible downtime by pre-zeroing hugepage

From: David Hildenbrand
Date: Tue Mar 12 2024 - 09:19:52 EST


On 12.03.24 14:09, Lance Yang wrote:
Hey David,

Thanks for taking time to review!

On Tue, Mar 12, 2024 at 12:19 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 08.03.24 08:49, Lance Yang wrote:
The patch reduces the process visible downtime during hugepage
collapse. This is achieved by pre-zeroing the hugepage before
acquiring mmap_lock(write mode) if nr_pte_none >= 256, without
affecting the efficiency of khugepaged.

On an Intel Core i5 CPU, the process visible downtime during
hugepage collapse is as follows:

| nr_ptes_none | w/o __GFP_ZERO | w/ __GFP_ZERO | Change |
--------------------------------------------------—----------
| 511 | 233us | 95us | -59.21%|
| 384 | 376us | 219us | -41.20%|
| 256 | 421us | 323us | -23.28%|
| 128 | 523us | 507us | -3.06%|

Of course, alloc_charge_hpage() will take longer to run with
the __GFP_ZERO flag.

| Func | w/o __GFP_ZERO | w/ __GFP_ZERO |
|----------------------|----------------|---------------|
| alloc_charge_hpage | 198us | 295us |

But it's not a big deal because it doesn't impact the total
time spent by khugepaged in collapsing a hugepage. In fact,
it would decrease.

It does look sane to me and not overly complicated.

But, it's an optimization really only when we have quite a bunch of
pte_none(), possibly repeatedly so that it really makes a difference.

Usually, when we repeatedly collapse that many pte_none() we're just
wasting a lot of memory and should re-evaluate life choices :)

Agreed! It seems that the default value of max_pte_none may be set too
high, which could result in the memory wastage issue we're discussing.

IIRC, some companies disable it completely (set to 0) because of that.



So my question is: do we really care about it that much that we care to
optimize?

IMO, although it may not be our main concern, reducing the impact of
khugepaged on the process remains crucial. I think that users also prefer
minimal interference from khugepaged.

The problem I am having with this is that for the *common* case where we have a small number of pte_none(), we cannot really optimize because we have to perform the copy.

So this feels like we're rather optimizing a corner case, and I am not so sure if that is really worth it.

Other thoughts?

--
Cheers,

David / dhildenb