Re: [PATCH mm-unstable v1] mm: add a total mapcount for large folios

From: David Hildenbrand
Date: Thu Aug 10 2023 - 13:48:30 EST


On 10.08.23 19:15, Peter Xu wrote:
On Thu, Aug 10, 2023 at 11:48:27AM +0100, Ryan Roberts wrote:
For PTE-mapped THP, it might be a bit bigger noise, although I doubt it is
really significant (judging from my experience on managing PageAnonExclusive
using set_bit/test_bit/clear_bit when (un)mapping anon pages).

As folio_add_file_rmap_range() indicates, for PTE-mapped THPs we should be
batching where possible (and Ryan is working on some more rmap batching).

Yes, I've just posted [1] which batches the rmap removal. That would allow you
to convert the per-page atomic_dec() into a (usually) single per-large-folio
atomic_sub().

[1] https://lore.kernel.org/linux-mm/20230810103332.3062143-1-ryan.roberts@xxxxxxx/

Right, that'll definitely make more sense, thanks for the link; I'd be very
happy to read more later (finally I got some free time recently..). But
then does it mean David's patch can be attached at the end instead of
proposed separately and early?

Not in my opinion. Batching rmap makes sense even without this change, and this change makes sense even without batching.


I was asking mostly because I read it as a standalone patch first, and
honestly I don't know the effect. It's based on not only the added atomic
ops itself, but also the field changes.

For example, this patch moves Hugh's _nr_pages_mapped into the 2nd tail
page, I think it means for any rmap change of any small page of a huge one
we'll need to start touching one more 64B cacheline on x86. I really have
no idea what does it mean for especially a large SMP: see 292648ac5cf1 on
why I had an impression of that. But I've no enough experience or clue to
prove it a problem either, maybe would be interesting to measure the time
needed for some pte-mapped loops? E.g., something like faulting in a thp,

Okay, so your speculation right now is:

1) The change in cacheline might be problematic.

2) The additional atomic operation might be problematic.

then measure the split (by e.g. mprotect() at offset 1M on a 4K?) time it
takes before/after this patch.

I can certainly try getting some numbers on that. If you're aware of other micro-benchmarks that would likely notice slower pte-mapping of THPs, please let me know.


When looking at this, I actually found one thing that is slightly
confusing, not directly relevant to your patch, but regarding the reuse of
tail page 1 on offset 24 bytes. Current it's Hugh's _nr_pages_mapped,
and you're proposing to replace it with the total mapcount:

atomic_t _nr_pages_mapped; /* 88 4 */

Now my question is.. isn't byte 24 of tail page 1 used for keeping a
poisoned mapping? See prep_compound_tail() where it has:

p->mapping = TAIL_MAPPING;

While here mapping is, afaict, also using offset 24 of the tail page 1:

struct address_space * mapping; /* 24 8 */

I hope I did a wrong math somewhere, though.


I think your math is correct.

prep_compound_head() is called after prep_compound_tail(), so prep_compound_head() wins.

In __split_huge_page_tail() there is a VM_BUG_ON_PAGE() that explains the situation:

/* ->mapping in first and second tail page is replaced by other uses */
VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
page_tail);

Thanks for raising that, I had to look into that myself.

--
Cheers,

David / dhildenb