Re: [PATCH 08/10] mm/hugetlb: Make walk_hugetlb_range() safe to pmd unshare

From: John Hubbard
Date: Tue Dec 06 2022 - 17:31:57 EST


On 12/6/22 13:51, Peter Xu wrote:
On Tue, Dec 06, 2022 at 01:03:45PM -0800, John Hubbard wrote:
On 12/6/22 08:45, Peter Xu wrote:
I've got a fixup attached. John, since this got your attention please also
have a look too in case there's further issues.


Well, one question: Normally, the pattern of "release_lock(A); call f();
acquire_lock(A);" is tricky, because one must revalidate that the state
protected by A has not changed while the lock was released. However, in
this case, it's letting page fault handling proceed, which already
assumes that pages might be gone, so generally that seems OK.

Yes it's tricky, but not as tricky in this case.

I hope my documentation supplemented that (in the fixup patch):

+ * @hugetlb_entry: if set, called for each hugetlb entry. Note that
+ * currently the hook function is protected by hugetlb
+ * vma lock to make sure pte_t* and the spinlock is valid
+ * to access. If the hook function needs to yield the

So far so good...

+ * thread or retake the vma lock for some reason, it
+ * needs to properly release the vma lock manually,
+ * and retake it before the function returns.

...but you can actually delete this second sentence. It does not add
any real information--clearly, if you must drop the lock, then you must
"manually" drop the lock.

And it still ignores my original question, which I don't think I've
fully communicated. Basically, what can happen to the protected data
during the time when the lock is not held?


The vma lock here makes sure the pte_t and the pgtable spinlock being
stable. Without the lock, they're prone to be freed in parallel.


Yes, but think about this: if the vma lock protects against the pte
going away, then:

lock()
get a pte
unlock()

...let hmm_vma_fault() cond_resched() run...

lock()
...whoops, something else release the pte that I'd previously
retrieved.


However, I'm lagging behind on understanding what the vma lock actually
protects. It seems to be a hugetlb-specific protection for concurrent
freeing of the page tables?

Not exactly freeing, but unsharing. Mike probably has more to say. The
series is here:

https://lore.kernel.org/all/20220914221810.95771-1-mike.kravetz@xxxxxxxxxx/#t

If so, then running a page fault handler seems safe. If there's something
else it protects, then we might need to revalidate that after
re-acquiring the vma lock.

Nothing to validate here. The only reason to take the vma lock is to match
with the caller who assumes the lock taken, so either it'll be released
very soon or it prepares for the next hugetlb pgtable walk (huge_pte_offset).


ummm, see above. :)


Also, scattering hugetlb-specific locks throughout mm seems like an
unfortuate thing, I wonder if there is a longer term plan to Not Do
That?

So far HMM is really the only one - normally hugetlb_entry() hook is pretty
light, so not really throughout the whole mm yet. It's even not urgently
needed for the other two places calling cond_sched(), I added it mostly
just for completeness, and with the slight hope that maybe we can yield
earlier for some pmd unsharers.

But yes it's unfortunate, I just didn't come up with a good solution.
Suggestion is always welcomed.


I guess it's on me to think of something cleaner, so if I do I'll pipe
up. :)

thanks,
--
John Hubbard
NVIDIA