Re: Kernel NULL pointer deref and data corruptions with xfs on 6.1

From: Matthew Wilcox
Date: Mon Jul 24 2023 - 23:42:07 EST


On Tue, Jul 25, 2023 at 07:45:25AM +1000, Dave Chinner wrote:
> On Mon, Jul 24, 2023 at 12:23:31PM +0100, Daniel Dao wrote:
> > Hi again,
> >
> > We had another example of xarray corruption involving xfs and zsmalloc. We are
> > running zram as swap. We have 2 tasks deadlock waiting for page to be released
>
> Do your problems on 6.1 go away if you stop using zram as swap?

I think zram is the victim here, not the culprit. I think what's
going on is that -- somehow -- there are stale pointers in the xarray.
zram allocates these pages (I suspect most of the memory in this machine
is allocated to zram or page cache) and then we blow up when finding
a folio in the page cache which has a ->mapping that is actually a
movable_ops structure.

But how do we get stale pointers in the xarray? I've been worrying at
that problem for months. At some point, the refcount must go down to
zero:

static inline void folio_put(struct folio *folio)
{
if (folio_put_testzero(folio))
__folio_put(folio);
}

(assume we're talking about a large folio; everything seems to point
that way):

__folio_put_large:
if (!folio_test_hugetlb(folio))
__page_cache_release(folio);
destroy_large_folio(folio);

destroy_large_folio:
free_transhuge_page()
free_transhuge_page:
free_compound_page(page);
free_compound_page:
free_the_page(page, compound_order(page));
free_the_page:
__free_pages_ok(page, order, FPI_NONE);
__free_pages_ok:
if (!free_pages_prepare(page, order, fpi_flags))
free_pages_prepare:
if (PageMappingFlags(page))
page->mapping = NULL;
(doesn't trigger; PageMappingFlags are false for page cache)
if (is_check_pages_enabled()) {
if (free_page_is_bad(page))
free_page_is_bad:
if (likely(page_expected_state(page, PAGE_FLAGS_CHECK_AT_FREE)))
return false;

/* Something has gone sideways, find it */
free_page_is_bad_report(page);
page_expected_state:
if (unlikely((unsigned long)page->mapping | ...
return false;

free_page_is_bad_report:
bad_page(page,
page_bad_reason(page, PAGE_FLAGS_CHECK_AT_FREE));
page_bad_reason:
if (unlikely(page->mapping != NULL))
bad_reason = "non-NULL mapping";

So (assuming that Daniel has check_pages_enabled set and isn't ignoring
important parts of dmesg, which seem like reasonable assumptions), the
last put of a folio must be after the folio has had its ->mapping cleared

But we remove the folio from the page cache in page_cache_delete(),
right before we set the mapping to NULL. And again in
delete_from_page_cache_batch() (in the other order; I don't think that's
relevant?)

So where do we set folio->mapping to NULL without removing folio from
the XArray? I'm beginning to suspect it's a mishandled failure in
split_huge_page(), so I'll re-review that code path tomorrow.