Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

From: Andrea Arcangeli
Date: Thu Jan 07 2021 - 18:49:46 EST


On Thu, Jan 07, 2021 at 02:51:24PM -0800, Linus Torvalds wrote:
> Ho humm. I had obviously not looked very much at that code. I had done
> a quick git grep, but now that I look closer, it *does* get the
> mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and
> then it does a mmap_write_downgrade().
>
> So that's why I had looked more at the UFFD code, because that one was
> the one I was aware of doing this all with just the read lock. I
> _thought_ the softdirty code already took the write lock and wouldn't
> race with page faults.
>
> But I had missed that write_downgrade. So yeah, this code has the same issue.

I overlooked the same thing initially. It's only when I noticed it
also used mmap_read_lock, that I figured that the group lock thingy
uffd-wp ad-hoc solution, despite it was fully self contained thanks to
the handle_userfault() catcher for the uffd-wp bit in the pagetable,
wasn't worth it since uffd-wp could always use whatever clear_refs
used to solve it.

> Anyway, the fix is - I think - the same I outlined earlier when I was
> talking about UFFD: take the thing for writing, so that you can't
> race.

Sure.

> The alternate fix remains "make sure we always flush the TLB before
> releasing the page table lock, and make COW do the copy under the page
> table lock". But I really would prefer to just have this code follow

The copy under PT lock isn't enough.

Flush TLB before releasing is enough of course.

Note also the patch in 2/2 patch that I sent:

https://lkml.kernel.org/r/20210107200402.31095-3-aarcange@xxxxxxxxxx

2/2 would have been my preferred solution for both and it works
fine. All corruption that was trivially reproducible with heavy
selftest program in the kernel, is all gone.

If only the TLB pending issue was the only regression page_count in
do_wp_page introduced, I would have never suggested we should
re-evaluate it. It'd be a good tradeoff in such case, even if it'd
penalize the soft-dirty runtime, especially if we were allowed to
deploy 2/2 as a non-blocking solution.

Until yesterday I fully intended to just propose 1/2 and 2/2, with a
whole different cover letter, CC stable and close this case.

> all the usual rules, and if it does a write protect, then it should
> take the mmap_sem for writing.

The problem isn't about performance anymore, the problem is a silent
ABI break to long term PIN user attached to an mm under clear_refs.

> Why is that very simple rule so bad?
>
> (And see my unrelated but incidental note on it being a good idea to
> try to minimize latency by making surfe we don't do any IO under the
> mmap lock - whether held for reading _or_ writing. Because I do think
> we can improve in that area, if you have some good test-case).

That would be great indeed.

Thanks,
Andrea