Re: [PATCH v3 3/7] mm: Add write-protect and clean utilities for address space ranges

From: Linus Torvalds
Date: Thu Oct 03 2019 - 12:55:43 EST


On Thu, Oct 3, 2019 at 12:56 AM Thomas Hellstrom <thellstrom@xxxxxxxxxx> wrote:
>
> Both the cleaning operation and the wp operation operate on shared
> writable mappings, and since they are also both restricted to entries
> that may take part in dirty-tracking (they're ignoring pmds and puds),
> so perhaps a "dirty" may make sense anyway, and to point out the similarity:
>
> clean_mapping_dirty_range() and wp_mapping_dirty_range()"?

Hmm. Yeah, either that, or "shared", as in "clean_shared_mapping_range()"?

I don't know. Maybe this isn't a huge deal, and the "don't care about
private dirty/write" is kind of implied by the fact that it's about
the backing mapping rather than the vma.

But on the other hand we do have things like "unmap_mapping_range()"
which will unmap all the private mappings too, so maybe it's a good
idea to clarify that this is about shared mappings.

I suspect it's a bit of a bike shed, and I could go either way.

> > And I'd rather see
> > separate 'struct mm_walk_ops' than shared ones that then have a flag
> > in a differfent structure to change behavior.
> >
> > Yes, yes, some of the levels share the same logic, but that just means
> > that you can use the same function pointer for that level in the
> > different "clean" vs "wp" mm_walk_op.
>
> I think that this comment and this last one:
>
> > Also, why do you loop inside the pmd_entry function, instead of just
> > having a pte_entry function?"
>
> are tied together. The pagewalk code is kind of awkward since if you
> provide a pte_entry function, then huge pmds are unconditionally split,

So I think that we could handle differently.

But even if you don't want to do the pte_entry function, the "use two
different set of walker op structures" stands. Instead of havign a
test for "am I walking for wp or not", just make it implicit.

Also, I do think that you're seeing what is a weakness in the pte
walkers wrt splitting. I do think we shouldn't split unconditionally.

I suspect we should instead change the logic to split only if we
didn't have a pmd_entry. Because looking at the existing cases, if you
have a pmd_entry, you currently never have a pte_entry, so it wouldn't
change semantics for current cases.

> even if handled in pmd_entry,
> forcing pmd-aware users to handle also ptes in pmd_entry(). I'm not
> entirely sure why, but it looks like it's to avoid a race where huge
> pmds could otherwise be unintentionally split if appearing *after* the
> pmd_entry() call.

See above: we never have both pmd_entry and pte_entry walkers as it is
right now. So you can't have that race.

> Other pagewalk users do the same here, see for
> example clear_refs_pte_range();
>
> https://elixir.bootlin.com/linux/latest/source/fs/proc/task_mmu.c#L1040
>
> Also the pagewalk walk_pte_range() for some reason doesn't take the page
> table lock, which means that a pte may well be cleared under us while we
> have it cached for modification.

Ho humm. That looks really sketchy to me. We actually have very few
pte_entry cases, perhaps exactly because it's all kinds of broken.

There's a couple of odd architectures, and then there is
prot_none_walk_ops. And that one purely looks at the pte's, and does a
special "can we do this" phase before doing the real modification.
It's limited to special PFNMAP/MIXEDMAP things.

So it does look like the pte walking code is accidentally (or
intentionally) ok in that case, but looking at the architecture users
(s390, at least), I do get the feeling that the lack of locking is an
actual and real bug.

It's probably impossible to hit in practice because it's limited to
special ops and you'd have to race with swapout anbd even then it
might not much matter. But it looks _wrong_.

And honestly, it looks like the clear_refs_pte_range() example you
point at comes from this for _exactly_ the same reason your code does
this: the pte walker is buggy, and the pmd case doesn't want to split.

The pte walker code in question has been there since 2008. This is not
a new thing. The clear_refs_pte_range() code partly goes all the way
back to that, although it's been changed some since. But the actual
pte_offset_map_lock() call in there predates the generic walker, so
what I think happened is that the generic walker was done only
partially, and intentionally didn't go all the way to the pte level
exactly because the walker was made to split things up unconditionally
if you did the pte case.

> For these reasons we can't use the pte_entry, even internally and this
> means we have three choices:
>
> a) Selecting the pte function using a bool. Saves code and avoids extra
> indirect function call.
> b) Duplicating the pmd_entry with a different pte function, also
> duplicating the ops. Avoids extra indirect function call but some extra
> code.
> c) Instead of the bool, use another function pointer in yet another ops
> pointed to by the private structure. Saves some code.

So considering that

- there is one current user of the pte code that doesn't care about
the proper pte lock, because what it does is a racy pre-check anyway

- that one user is also documented to not care too much about
performance (it's a crazy special case)

- the other users _do_ look like they are currently buggy and want a
stable pte with proper locking

- there are currently no cases of "I have a pmd walker _and_ a pte
walker" because that case was always broken due to splitting

I would say that the current walker code is simply buggy in this
respect, and people have worked around it for over a decade.

End result: since you have an actual test-case that wants this, I'd
like to at least try

d) Fix the pte walker to do the right thing, then just use separate
pte walkers in your code

The fix would be those two conceptual changes:

1) don't split if the walker asks for a pmd_entry (the walker itself
can then decide to split, of course, but right now no walkers want it
since there are no pmd _and_ pte walkers, because people who want that
do the pte walk themselves)

2) get the proper page table lock if you do walk the pte, since
otherwise it's racy

Then there won't be any code duplication, because all the duplication
you now have at the pmd level is literally just workarounds for the
fact that our current walker has this bug.

That "fix the pte walker" would be one preliminary patch that would
look something like the attached TOTALLY UNTESTED garbage.

I call it "garbage" because I really hope people take it just as what
it is: "something like this". It compiles for me, and I did try to
think it through, but I might have missed some big piece of the
picture when writing that patch.

And yes, this is a much bigger conceptual change for the VM layer, but
I really think our pagewalk code is actively buggy right now, and is
forcing users to do bad things because they work around the existing
limitations.

Hmm? Could some of the core mm people look over that patch?

And yes, I was tempted to move the proper pmd locking into the walker
too, and do

ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
err = ops->pmd_entry(pmd, addr, next, walk);
spin_unlock(ptl);
...

but while I think that's the correct thing to do in the long run, that
would have to be done together with changing all the existing
pmd_entry users. It would make the pmd_entry _solely_ handle the
hugepage case, and then you'd have to remove the locking in the
pmd_entry, and have to make the pte walking be a walker entry. But
that would _really_ clean things up, and would make things like
smaps_pte_range() much easier to read, and much more obvious (it would
be split into a smaps_pmd_range and smaps_pte_range, and the callbacks
wouldn't need to know about the complex locking).

So I think this is the right direction to move into, but I do want
people to think about this, and think about that next phase of doing
the pmd_trans_huge_lock too.

Comments?

Linus
mm/pagewalk.c | 37 +++++++++++++++++++++----------------
1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..6ae95932e799 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -7,11 +7,13 @@
static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
- pte_t *pte;
+ pte_t *pte, *start_pte;
int err = 0;
const struct mm_walk_ops *ops = walk->ops;
+ spinlock_t *ptl;

- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ start_pte = pte;
for (;;) {
err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
@@ -22,7 +24,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte++;
}

- pte_unmap(pte);
+ pte_unmap_unlock(start_pte, ptl);
return err;
}

@@ -49,21 +51,24 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
* This implies that each ->pmd_entry() handler
* needs to know about pmd_trans_huge() pmds
*/
- if (ops->pmd_entry)
+ if (ops->pmd_entry) {
err = ops->pmd_entry(pmd, addr, next, walk);
- if (err)
- break;
-
- /*
- * Check this here so we only break down trans_huge
- * pages when we _need_ to
- */
- if (!ops->pte_entry)
- continue;
+ if (err)
+ break;
+ /* No pte level walking? */
+ if (!ops->pte_entry)
+ continue;
+ /* No pte level at all? */
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ continue;
+ } else {
+ if (!ops->pte_entry)
+ continue;

- split_huge_pmd(walk->vma, pmd, addr);
- if (pmd_trans_unstable(pmd))
- goto again;
+ split_huge_pmd(walk->vma, pmd, addr);
+ if (pmd_trans_unstable(pmd))
+ goto again;
+ }
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;