Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

From: Andrea Arcangeli
Date: Tue Dec 22 2020 - 21:58:32 EST


On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote:
> We are talking about non-COW anon pages here -- they can't be mapped
> more than once. So why not just identify them by checking
> page_mapcount == 1 and then unconditionally reuse them? (This is
> probably where I've missed things.)

The problem in depending on page_mapcount to decide if it's COW or
non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP
may elevate the count of a COW anon page that become a non-COW anon
page.

This is Jann's idea not mine.

The problem is we have an unprivileged long term GUP like vmsplice
that facilitates elevating the page count indefinitely, until the
parent finally writes a secret to it. Theoretically a short term pin
would do it too so it's not just vmpslice, but the short term pin
would be incredibly more challenging to become a concern since it'd
kill a phone battery and flash before it can read any data.

So what happens with your page_mapcount == 1 check is that it doesn't
mean non-COW (we thought it did until it didn't for the long term gup
pin in vmsplice).

Jann's testcases does fork() and set page_mapcount 2 and page_count to
2, vmsplice, take unprivileged infinitely long GUP pin to set
page_count to 3, queue the page in the pipe with page_count elevated,
munmap to drop page_count to 2 and page_mapcount to 1.

page_mapcount is 1, so you'd think the page is non-COW and owned by
the parent, but the child can still read it so it's very much still
wp_page_copy material if the parent tries to modify it. Otherwise the
child can read the content.

This was supposed to be solvable by just doing the COW in gup(write=0)
case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly
sure why that didn't fly and it had to be reverted by Peter in
a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was
happening I was side tracked by urgent issues and I didn't manage to
look back of how we ended up with the big hammer page_count == 1 check
instead to decide if to call wp_page_reuse or wp_page_shared.

So anyway, the only thing that is clear to me is that keeping the
child from reading the page_mapcount == 1 pages of the parent, is the
only reason why wp_page_reuse(vmf) will only be called on
page_count(page) == 1 and not on page_mapcount(page) == 1.

It's also the reason why your page_mapcount assumption will risk to
reintroduce the issue, and I only wish we could put back page_mapcount
== 1 back there.

Still even if we put back page_mapcount there, it is not ok to leave
the page fault with stale TLB entries and to rely on the fact
wp_page_shared won't run. It'd also avoid the problem but I think if
you leave stale TLB entries in change_protection just like NUMA
balancing does, it also requires a catcher just like NUMA balancing
has, or it'd truly work by luck.

So until we can put a page_mapcount == 1 check back there, the
page_count will be by definition unreliable because of the speculative
lookups randomly elevating all non zero page_counts at any time in the
background on all pages, so you will never be able to tell if a page
is true COW or if it's just a spurious COW because of a speculative
lookup. It is impossible to differentiate a speculative lookup from a
vmsplice ref in a child.

Thanks,
Andrea