Re: [PATCH] mm/khugepaged: fix filemap page_to_pgoff(page) != offset

From: Matthew Wilcox
Date: Sat Oct 10 2020 - 18:52:17 EST


On Fri, Oct 09, 2020 at 08:07:59PM -0700, Hugh Dickins wrote:
> There have been elusive reports of filemap_fault() hitting its
> VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
> with CONFIG_READ_ONLY_THP_FOR_FS=y.
>
> Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
> CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
> without NUMA reuses the same huge page after collapse_file() failed
> (whereas NUMA targets its allocation to the respective node each time).
> And most of us were usually testing with CONFIG_NUMA=y kernels.

Good catch. There have been at least three bugs in recent times which
can cause this VM_BUG_ON_PAGE() to trigger. This one, one where swapping
out a THP led to all 512 entries pointing to the same non-huge page on
swapin (fixed in -mm) and one that I introduced for a few weeks in -mm
where failing to split a THP would lead to random tree corruption due
to a non-zeroed node being freed to the slab cache.

There may yet be a fourth. I've seen it occasionally in recent testing
so I'll add this patch and see if it disappears.

> Instead, non-NUMA khugepaged_prealloc_page() release the old page
> if anyone else has a reference to it (1% of cases when I tested).

I think this is a good way to fix the problem. We could also change
khugepaged to insert a frozen page, ensuring that find_get_entry()
would spin until the collapse has succeeded or the page was removed
from the cache again. But I have no problem with this approach.

I want to note that this is a silent data corruption for reads.
generic_file_buffered_read() has a reference to the page, so this
patch will fix it, but before it could be copying the wrong data
to userspace.

Reviewed-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>