Re: [PATCH 1/4] mm/memory: convert do_page_mkwrite() to use folios

From: Matthew Wilcox
Date: Tue Jul 04 2023 - 15:36:08 EST


On Sun, Jul 02, 2023 at 10:58:47PM -0700, Sidhartha Kumar wrote:
> @@ -2947,14 +2947,14 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
> return ret;
> if (unlikely(!(ret & VM_FAULT_LOCKED))) {
> - lock_page(page);
> - if (!page->mapping) {
> - unlock_page(page);
> + folio_lock(folio);
> + if (!folio_mapping(folio)) {
> + folio_unlock(folio);

I promised to explain this better once I had time, and I have time now.

folio->mapping is used for a multitude of purposes, unfortunately.
Maybe some future work will reduce that, but for now, These Are The Rules.

If the folio is marked as being Slab, it's used for something else.
The folio does not belong to an address space (nor can it be mapped,
so we're not going to see it here, but sometimes we see it in other
contexts where we call folio_mapping()).

The bottom two bits are used as PAGE_MAPPING_FLAGS. If they're both
0, this folio belongs to a file, and the rest of folio->mapping is a
pointer to a struct address_space. Or they're both 0 because the whole
thing is NULL. More on that below. If the bottom two bits are 01b,
this is an anonymous folio, and folio->mapping is actually a pointer
to an anon_vma (which is not the same thing as an anon vma). If the
bottom two bits are 10b, this is a Movable page (anon & file memory is
also movable, but this is different). The folio->mapping points to a
struct movable_operations. If the bottom two bits are 11b, this is a
KSM allocation, and folio->mapping points to a struct ksm_stable_node.

When we remove a folio from the page cache, we reset folio->mapping
to NULL. We often remove folios from the page cache before their
refcount drops to zero (the common case is to look up the folio in the
page cache, which grabs a reference, remove the folio from the page
cache which decrements the refcount, then put the folio which might be
the last refcount). So it's entirely possible to see a folio in this
function with a NULL mapping; that means it's been removed from the
file through a truncate or siimlar, and we need to fail the mkwrite.
Userspace is about to get a segfault.

If you find all of that confusing, well, I agree, and I'm trying to
simplify it.

So, with all that background, what's going on here? Part of the
"modern" protocol for handling page faults is to lock the folio
in vm_ops->page_mkwrite. But we still support (... why?) drivers
that haven't been updated. They return 0 on success instead of
VM_FAULT_LOCKED. So we take the lock for them, then check that the
folio wasn't truncated, and bail out if it looks like it was.

If we have a really old-school driver that has allocated a page,
mapped it to userspace, and set page->mapping to be, eg, Movable, by
calling folio_mapping() instead of folio->mapping, we'll end up seeing
NULL instead of a non-NULL value, mistakenly believe it to have been
truncated and enter an endless loop.

Am I being paranoid here? Maybe! Drivers should have been updated by
now. The "modern" way was introduced in 2007 (commit d0217ac04ca6), so
it'd be nice to turn this into a WARN_ON_ONCE so drivers fix their code.
There are only ~30 implementations of page_mkwrite in the kernel, so it
might not take too long to check everything's OK.