Re: Kernel NULL pointer deref and data corruptions with xfs on 6.1

From: Daniel Dao
Date: Mon Jul 24 2023 - 18:04:32 EST


On Mon, Jul 24, 2023 at 10:45 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jul 24, 2023 at 12:23:31PM +0100, Daniel Dao wrote:
> > Hi again,
> >
> > We had another example of xarray corruption involving xfs and zsmalloc. We are
> > running zram as swap. We have 2 tasks deadlock waiting for page to be released
>
> Do your problems on 6.1 go away if you stop using zram as swap?

We had xarray corruptions even on nodes without swap, so I'm not sure
if swap matters.
The corruption on those nodes were noted in the first email with the
following trace

BUG: kernel NULL pointer dereference, address: 0000000000000036
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 18806c5067 P4D 18806c5067 PUD 188ed48067 PMD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 73 PID: 3579408 Comm: prometheus Tainted: G O
6.1.34-cloudflare-2023.6.7 #1
Hardware name: GIGABYTE R162-Z12-CD1/MZ12-HD4-CD, BIOS M03 11/19/2021
RIP: 0010:__filemap_get_folio (arch/x86/include/asm/atomic.h:29
include/linux/atomic/atomic-arch-fallback.h:1242
include/linux/atomic/atomic-arch-fallback.h:1267
include/linux/atomic/atomic-instrumented.h:608
include/linux/page_ref.h:238 include/linux/page_ref.h:247
include/linux/page_ref.h:280 include/linux/page_ref.h:313
mm/filemap.c:1863 mm/filemap.c:1915)

It's hard for us to run tests without zram swap at scale since the
benefits are significant with a lot of
workloads.

Daniel.