Re: mm: delay rmap removal until after TLB flush

From: David Hildenbrand
Date: Thu Nov 03 2022 - 05:53:48 EST


On 31.10.22 19:43, Linus Torvalds wrote:
Updated subject line, and here's the link to the original discussion
for new people:

https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@xxxxxxxxx/

On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

Ok. At that point we no longer have the pte or the virtual address, so
it's not going to be exactly the same debug output.

But I think it ends up being fairly natural to do

VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);

instead, and I've fixed that last patch up to do that.

Ok, so I've got a fixed set of patches based on the feedback from
PeterZ, and also tried to do the s390 updates for this blindly, and
pushed them out into a git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix

If people really want to see the patches in email again, I can do
that, but most of you already have, and the changes are either trivial
fixes or the s390 updates.

For the s390 people that I've now added to the participant list maybe
the git tree is fine - and the fundamental explanation of the problem
is in that top-most commit (with the three preceding commits being
prep-work). Or that link to the thread about this all.

That top-most commit is also where I tried to fix things up for s390
that uses its own non-gathering TLB flush due to
CONFIG_MMU_GATHER_NO_GATHER.

NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
rebased etc for further comments and fixes. So don't consider that
stable, it's still more of an RFC branch.

At a minimum I'll update it with Ack's etc, assuming I get those, and
my s390 changes are entirely untested and probably won't work.

As far as I can tell, s390 doesn't actually *have* the problem that
causes this change, because of its synchronous TLB flush, but it
obviously needs to deal with the change of rmap zapping logic.

Also added a few people who are explicitly listed as being mmu_gather
maintainers. Maybe people saw the discussion on the linux-mm list, but
let's make it explicit.

Do people have any objections to this approach, or other suggestions?

I do *not* consider this critical, so it's a "queue for 6.2" issue for me.

It probably makes most sense to queue in the -MM tree (after the thing
is acked and people agree), but I can keep that branch alive too and
just deal with it all myself as well.

Anybody?

Happy to see that we're still decrementing the mapcount before decrementingthe refcount, I was briefly concerned.

I was not able to come up quickly with something that would be fundamentally wrong here, but devil is in the detail.

Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON instead of VM_BUG_ON).

I agree that 6.2 is good enough and that upstreaming this via the -MM tree would be a good way to move forward.

--
Thanks,

David / dhildenb