two reports about NULL pointer dereferences in mm subsystem since 5.18(?) with qbittorrent on XFS

From: Linux regression tracking (Thorsten Leemhuis)
Date: Wed Jul 26 2023 - 08:43:44 EST


Hi everyone! There are two regression reports with somewhat similar
symptoms in bugzilla.kernel.org that seem to not get the attention they
IMHO deserve. Both reports describe oopses due to a NULL pointer
dereference that seems to happen in the mm subsystem when using
qbittorrent on XFS. It seems quite a few users of that combination run
into problems.

One report initially received replies from Akpm and Willy, but at some
point the regression apparently fell through the cracks (I prodded a few
times, but was ignored). The other never got a reply from a mm developer
afaics. I hope to revive things with this mail to finally get them on
track and resolved.

What follows is a *very rough* overview about the two reports which
might be caused by the same or different bugs in the code. Both
descriptions will leave out lots of details on purpose, as it can easily
happen that I understood something wrong; sorry, this can happen in my
position, so handle the following with care.


The older of the two reports is
https://bugzilla.kernel.org/show_bug.cgi?id=216646 (from November last
year); the report already mentioned that disabling THP avoids the oopses
due to a NULL pointer dereference. The problems was actually bisected to
793917d997d ("mm/readahead: Add large folio readahead") [v5.18-rc1] at
one point (https://bugzilla.kernel.org/show_bug.cgi?id=216646#c6 ). The
reporter was able to avoid the problem by switching from XFS to Btrfs
(https://bugzilla.kernel.org/show_bug.cgi?id=216646#c18 ). Someone else
not that long ago in the ticket reported that 66dabbb65d6 ("mm: return
an ERR_PTR from __filemap_get_folio") [v6.4-rc1] is a partial fix (side
note: I wonder if we should backport that to 6.1.y). The problem is also
discussed in https://github.com/arvidn/libtorrent/issues/6952, as it
seems quite a few other users of libtorrent based software encountered
it. I recently asked in that ticket to test if 6.5-rc is still affected,
but got no reply so far.


The other report is https://bugzilla.kernel.org/show_bug.cgi?id=217441;
the reporter never managed a bisection. Disabling THP apparently does
not help for tha reporter, so it might be something totally different. I
recently asked if the problems is still happening with recent kernels.
Two users confirmed it does, it just takes a lot longer to trigger (if
that is due to 66dabbb65d6 or not is not known). One of those users
tested with linux-next and provided this:

> Jul 20 19:03:24 smoon7.bkoty.ru kernel: BUG: kernel NULL pointer dereference, address: 0000000000000096
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: #PF: supervisor read access in kernel mode
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: #PF: error_code(0x0000) - not-present page
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: PGD 0 P4D 0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: CPU: 4 PID: 305164 Comm: qbittorrent-nox Tainted: G U 6.5.0-rc2-next-20230718-1-next-git-03113-gaeba456828b4 #1 8d98cf92e1199e734fba1ef76a33030687665b92
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: Hardware name: Gigabyte Technology Co., Ltd. H470M DS3H/H470M DS3H, BIOS F4b 06/22/2020
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RIP: 0010:filemap_get_entry+0x8a/0x130
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: Code: 24 18 03 00 00 00 48 89 e7 e8 32 54 a6 00 48 89 c3 48 3d 02 04 00 00 74 e4 48 3d 06 04 00 00 74 dc 48 85 c0 74 4f a8 01 75 4b <8b> 40 34 85 c0 74 cc 8d 50 01 f0 0f b1 53 34 75 f2 48 8b 54 24 18
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RSP: 0000:ffffb7d6c4e1bc70 EFLAGS: 00010246
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RAX: 0000000000000062 RBX: 0000000000000062 RCX: 0000000000000002
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RDX: 000000000000001c RSI: ffff996fdd57c490 RDI: ffffb7d6c4e1bc70
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RBP: 0000000000000000 R08: ffffffffffffffc0 R09: 0000000000000000
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: R10: ffff99695e549d50 R11: ffff99695e549d0c R12: ffff99699ffe2fc0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: R13: 000000000001eb9d R14: 0000000000000000 R15: ffff9969a06cfef8
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: FS: 00007f9077fff6c0(0000) GS:ffff9970de300000(0000) knlGS:0000000000000000
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: CR2: 0000000000000096 CR3: 0000000344754004 CR4: 00000000003706e0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: Call Trace:
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: <TASK>
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: ? __die+0x23/0x70
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: ? page_fault_oops+0x171/0x4e0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: ? psi_group_change+0x213/0x3c0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: ? exc_page_fault+0x7f/0x180
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: ? asm_exc_page_fault+0x26/0x30
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: ? filemap_get_entry+0x8a/0x130
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: __filemap_get_folio+0x2b/0x230
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: filemap_fault+0x6b/0x9f0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: ? filemap_map_pages+0x2dc/0x560
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: __do_fault+0x30/0x130
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: do_fault+0x26c/0x430
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: __handle_mm_fault+0x73f/0xbb0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: handle_mm_fault+0x17f/0x360
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: do_user_addr_fault+0x1e6/0x640
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: exc_page_fault+0x7f/0x180
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: asm_exc_page_fault+0x26/0x30
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RIP: 0033:0x7f909336cb0d
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: Code: 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 48 89 f8 48 83 fa 20 72 23 <c5> fe 6f 06 48 83 fa 40 0f 87 a5 00 00 00 c5 fe 6f 4c 16 e0 c5 fe
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RSP: 002b:00007f9077ffd298 EFLAGS: 00010202
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RAX: 00007f9050012770 RBX: 00007f9077ffe300 RCX: 00007f9077ffd4c0
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RDX: 0000000000004000 RSI: 00007f77e759d10f RDI: 00007f9050012770
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000000
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> Jul 20 19:03:24 smoon7.bkoty.ru kernel: R13: 00007f9050000bf0 R14: 0000000000000007 R15: 000000001eb9d10f
> [...]

But we apparently have at least two people that still care about this
problem. Could anyone maybe help them somewhat so we can with a bit of
luck maybe finally get down to the actually cause and fix it?


That's it from my side. Really hope this will get things moving again.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.