Re: [PATCH RFC v2 00/12] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

From: David Hildenbrand
Date: Fri Nov 25 2022 - 04:45:45 EST


On 23.11.22 16:09, Peter Xu wrote:
Hi, David,

Thanks for taking a look.

On Wed, Nov 23, 2022 at 10:40:40AM +0100, David Hildenbrand wrote:
On 18.11.22 02:10, Peter Xu wrote:
Based on latest mm-unstable (96aa38b69507).

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing, so this series also depends on that one.
Hopefully this series can make it a more complete resolution for pmd
unsharing.

PS: so far no one strongly ACKed this, let me keep the RFC tag. But I
think I'm already more confident than many of the RFCs I posted.

PS2: there're a lot of changes comparing to rfcv1, so I'm just not adding
the changelog. The whole idea is still the same, though.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.

For normal memory types that's far enough, since any pgtable removal
requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
has the pmd unshare feature, it means not only the pgtable page can be gone
from under us when we're doing a walking, but also the pgtable page we're
walking (even after unshared, in this case it can only be the huge PUD page
which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
possible because even though freeing the pgtable page requires mmap write
lock, it doesn't help us when we're walking on another mm's pgtable, so
it's still on risk even if we're with the current->mm's mmap lock.

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed. It means if we can take the
vma lock around all huge_pte_offset() callers it'll be safe.

[1]


There're already a bunch of them that we did as per the latest mm-unstable,
but also quite a few others that we didn't for various reasons. E.g. it
may not be applicable for not-allow-to-sleep contexts like FOLL_NOWAIT.
Or, huge_pmd_share() is actually a tricky user of huge_pte_offset(),

[2]

because even if we took the vma lock, we're walking on another mm's vma!
Taking vma lock for all the vmas are probably not gonna work.

I have totally no report showing that I can trigger such a race, but from
code wise I never see anything that stops the race from happening. This
series is trying to resolve that problem.

Let me try understand the basic problem first:

hugetlb walks page tables semi-lockless: while we hold the mmap lock, we
don't grab the page table locks. That's very hugetlb specific handling and I
assume hugetlb uses different mechanisms to sync against MADV_DONTNEED,
concurrent page fault s... but that's no news. hugetlb is weird in many ways
:)

So, IIUC, you want a mechanism to synchronize against PMD unsharing. Can't
we use some very basic locking for that?


Sorry for the delay, finally found time to look into this again. :)

Yes we can in most cases. Please refer to above paragraph [1] where I
referred Mike's recent work on vma lock. That's the basic locking we need
so far to protect pmd unsharing. I'll attach the link too in the next
post, which is here:

https://lore.kernel.org/r/20220914221810.95771-1-mike.kravetz@xxxxxxxxxx


Using RCU / disabling local irqs seems a bit excessive because we *are*
holding the mmap lock and only care about concurrent unsharing

The series wanted to address where the vma lock is not easy to take. It
originates from when I was reading Mike's other patch, I forgot why I did
that but I just noticed there's some code path that we may not want to take
a sleepable lock, e.g. in follow page code.

As I stated, whenever we already take the (expensive) mmap lock, the least thing we should have to worry about is taking another sleepable lock IMHO. Please correct me if I'm wrong.


The other one is huge_pmd_share() where we may have the mmap lock for
current mm but we're fundamentally walking another mm. It'll be tricky to
take a sleepable lock in such condition too.

We're already grabbing the i_mmap_lock_read(mapping), and the VMAs are should be stable in that interval tree IIUC. So I wonder if taking VMA locks would really be problematic here. Anything obvious I am missing?


I mentioned these cases in the other paragraph above [2]. Let me try to
expand that in my next post too.

That would be great. I yet have to dedicate more time to understand all that complexity.


It's debatable whether all the rest places can only work with either RCU or
irq disabled, but the idea is at least it should speed up those paths when
we still can. Here, irqoff might be a bit heavy, but RCU lock should be
always superior to vma lock when possible, the payoff is we may still see
stale pgtable data (since unsharing can still happen in parallel), while
that can be completely avoided when we take the vma lock.

IRQ disabled is frowned upon by RT folks, that's why I'd like to understand if this is really required. Also, adding RCU to an already complex mechanism doesn't necessarily make it easier :)

Let me dedicate some time today to dig into some details.

--
Thanks,

David / dhildenb