Fwd: _filemap_get_folio and NULL pointer dereference

From: Bagas Sanjaya
Date: Mon May 15 2023 - 22:45:10 EST


Hi,

I notice a regression report on bugzilla [1]. Quoting from it:

> Hello.
>
> (I apologize if I chose the wrong "Product" and "Component".)
>
> On two of my systems, I see strange "bug" when running 6+ kernels (below is a recent one):
>
> ```
> May 14 14:48:07 smoon7.bkoty.ru kernel: RIP: 0010:__filemap_get_folio+0xbf/0x6a0
> May 14 14:48:07 smoon7.bkoty.ru kernel: Code: ef e8 c5 60 c3 00 48 89 c7 48 3d 02 04 00 00 74 e4 48 3d 06 04 00 00 74 dc 48 85 c0 0f 84 6a 04 00 00 a8 01 0f 85 6c 04 00 00 <8b> 40 34 85 c0 74 c4 8d 50 01 4c 8d 47 34 f0 0f b1 57 34 75 ee 48
> May 14 14:48:07 smoon7.bkoty.ru kernel: RSP: 0000:ffffa7800b1dfbf8 EFLAGS: 00010246
> May 14 14:48:07 smoon7.bkoty.ru kernel: RAX: 0000000000000002 RBX: 0000000000000000 RCX: 0000000000000004
> May 14 14:48:07 smoon7.bkoty.ru kernel: RDX: ffffa7800b1dfc50 RSI: ffff9a2413646910 RDI: 0000000000000002
> May 14 14:48:07 smoon7.bkoty.ru kernel: RBP: 0000000000000000 R08: ffffffffffffffc0 R09: 00007f862b600000
> May 14 14:48:07 smoon7.bkoty.ru kernel: R10: 00007f8659246f48 R11: ffff9a21c1494a0c R12: 000000000002dc46
> May 14 14:48:07 smoon7.bkoty.ru kernel: R13: ffffa7800b1dfc50 R14: ffff9a21e2cb82b0 R15: 00007f8659246f48
> May 14 14:48:07 smoon7.bkoty.ru kernel: FS: 00007f87fcff96c0(0000) GS:ffff9a295e280000(0000) knlGS:0000000000000000
> May 14 14:48:07 smoon7.bkoty.ru kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> May 14 14:48:07 smoon7.bkoty.ru kernel: CR2: 0000000000000036 CR3: 0000000105b2c003 CR4: 00000000003706e0
> May 14 14:48:07 smoon7.bkoty.ru kernel: Call Trace:
> May 14 14:48:07 smoon7.bkoty.ru kernel: <TASK>
> May 14 14:48:07 smoon7.bkoty.ru kernel: ? psi_group_change+0x274/0x430
> May 14 14:48:07 smoon7.bkoty.ru kernel: filemap_fault+0x6f/0xfd0
> May 14 14:48:07 smoon7.bkoty.ru kernel: ? filemap_map_pages+0x15f/0x640
> May 14 14:48:07 smoon7.bkoty.ru kernel: __do_fault+0x30/0x130
> May 14 14:48:07 smoon7.bkoty.ru kernel: do_fault+0x1d7/0x400
> May 14 14:48:07 smoon7.bkoty.ru kernel: handle_mm_fault+0xb48/0x1450
> May 14 14:48:07 smoon7.bkoty.ru kernel: do_user_addr_fault+0x1c7/0x740
> May 14 14:48:07 smoon7.bkoty.ru kernel: exc_page_fault+0x7c/0x180
> May 14 14:48:07 smoon7.bkoty.ru kernel: asm_exc_page_fault+0x26/0x30
> May 14 14:48:07 smoon7.bkoty.ru kernel: RIP: 0033:0x7f881a56cb0d
> May 14 14:48:07 smoon7.bkoty.ru kernel: Code: 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 48 89 f8 48 83 fa 20 72 23 <c5> fe 6f 06 48 83 fa 40 0f 87 a5 00 00 00 c5 fe 6f 4c 16 e0 c5 fe
> May 14 14:48:07 smoon7.bkoty.ru kernel: RSP: 002b:00007f87fcff72c8 EFLAGS: 00010202
> May 14 14:48:07 smoon7.bkoty.ru kernel: RAX: 00007f87dc02a700 RBX: 00007f87fcff8308 RCX: 00007f87fcff7500
> May 14 14:48:07 smoon7.bkoty.ru kernel: RDX: 0000000000004000 RSI: 00007f8659246f48 RDI: 00007f87dc02a700
> May 14 14:48:07 smoon7.bkoty.ru kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> May 14 14:48:07 smoon7.bkoty.ru kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
> May 14 14:48:07 smoon7.bkoty.ru kernel: R13: 00007f87dc001370 R14: 0000000000000009 R15: 00005645d0719a70
> May 14 14:48:07 smoon7.bkoty.ru kernel: </TASK>
> ```
>
> I've seen these errors since the very first kernel of the 6 series, while I see no problem with 5.15 on the same hardware.
>
> These two systems have the same CPU (Intel(R) Core(TM) i5-10500 CPU @ 3.10GHz) but slightly different motherboards, same amount of memory (same manufacturer, I tested it when plugged in).
>
> The hosts in question don't show this "bug" immediately, but after some time while having "heavy" disk load (torrents). The "bug" shows up whether I use `mitigations=off` or not (at first I thought the "bug" might be related to `mitigations=off`, but I got the above output when I removed that setting from the kernel command line).
>
> What puzzles me is that I don't see these errors on the other hosts (but they don't have "heavy" disk loads), they work just fine. On the other hand, they have different CPUs (not i5-10500). Sometimes (less often than this error) I saw the following in the kernel log (dmesg):
>
> ```
> May 14 08:09:09 smoon7.bkoty.ru kernel: mce: [Hardware Error]: Machine check events logged
> May 14 08:09:09 smoon7.bkoty.ru kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: 9000004000010005
> May 14 08:09:09 smoon7.bkoty.ru kernel: mce: [Hardware Error]: TSC 95596a63008b
> May 14 08:09:09 smoon7.bkoty.ru kernel: mce: [Hardware Error]: PROCESSOR 0:a0653 TIME 1684022949 SOCKET 0 APIC 0 microcode f6
> May 14 08:11:39 smoon7.bkoty.ru kernel: mce: [Hardware Error]: Machine check events logged
> May 14 08:11:39 smoon7.bkoty.ru kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 0: 9000004000010005
> May 14 08:11:39 smoon7.bkoty.ru kernel: mce: [Hardware Error]: TSC 95c56b82abf0
> May 14 08:11:39 smoon7.bkoty.ru kernel: mce: [Hardware Error]: PROCESSOR 0:a0653 TIME 1684023099 SOCKET 0 APIC a microcode f6
> ```
>
> So now I'm thinking of buying a new CPU (same socket) and see if I will see the same error.

For the full thread, see bugzilla.

FYI, filemap_get_folio() is introduced in 3f0c6a07fee6a1 ("mm/filemap:
Add filemap_get_folio").

Anyway, I'm adding this to regzbot:

#regzbot introduced: v5.15..v6.0 https://bugzilla.kernel.org/show_bug.cgi?id=217441
#regzbot title: NULL pointer dereference on filemap_get_folio() on Intel Core i5-10500

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217441

--
An old man doll... just what I always wanted! - Clara