[regression, stable] Re: Bug 215562 - BUG: unable to handle page fault in cache_reap (fwd from bugzilla)

From: Thorsten Leemhuis
Date: Wed Feb 16 2022 - 03:44:21 EST


Hi, this is your Linux kernel regression tracker speaking. Top-posting
for once, to make this easy accessible to everyone.

Below issue that started to happen between v5.10.80..v5.10.90 was
recently reported to bugzilla, but the reporter didn't even get a single
reply afaics. Could somebody maybe take a look? Bisection is likely no
easy in this case, so a few tips to narrow down the area to search might
help a lot here.

https://bugzilla.kernel.org/show_bug.cgi?id=215562

Ciao, Thorsten


On 03.02.22 16:03, Thorsten Leemhuis wrote:
> Hi, this is your Linux kernel regression tracker speaking.
>
> There is a regression in bugzilla.kernel.org I'd like to add to the
> tracking:
>
> #regzbot introduced: v5.10.80..v5.10.90
> #regzbot from: Patrick Schaaf <kernelorg@xxxxxx>
> #regzbot title: mm: unable to handle page fault in cache_reap
> #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215562
>
> Quote:
>
>> We've been running self-built 5.10.x kernels on DL380 hosts for quite a while, also inside the VMs there.
>>
>> With I think 5.10.90 three weeks or so back, we experienced a lockup upon umounting a larger, dirty filesystem on the host side, unfortunately without capturing a backtrace back then.
>>
>> Today something feeling similar, happened again, on a machine running 5.10.93 both on the host and inside its 10 various VMs.
>>
>> Problem showed shortly (minutes) after shutting down one of the VMs (few hundred GB memory / dataset, VM shutdown was complete already; direct I/O), and then some LVM volume renames, a quick short outside ext4 mount followed by an umount (8 GB volume, probably a few hundred megabyte only to write). Actually monitoring suggests that disk writes were already done about a minute before the onset.
>>
>> What we then experienced, was the following BUG:, followed by one after the other CPU saying goodbye with soft lockup messages over the course of a few minutes; meanwhile there was no more pinging the box, logging in on console, etc. We hard powercycled and it recovered fully.
>>
>> here's the BUG that was logged; if it is useful for someone to see the followup soft lockup messages, tell me + I'll add them.
>>
>> Feb 02 15:22:27 kvm3j kernel: BUG: unable to handle page fault for address: ffffebde00000008
>> Feb 02 15:22:27 kvm3j kernel: #PF: supervisor read access in kernel mode
>> Feb 02 15:22:27 kvm3j kernel: #PF: error_code(0x0000) - not-present page
>> Feb 02 15:22:27 kvm3j kernel: Oops: 0000 [#1] SMP PTI
>> Feb 02 15:22:27 kvm3j kernel: CPU: 7 PID: 39833 Comm: kworker/7:0 Tainted: G I 5.10.93-kvm #1
>> Feb 02 15:22:27 kvm3j kernel: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/20/2013
>> Feb 02 15:22:27 kvm3j kernel: Workqueue: events cache_reap
>> Feb 02 15:22:27 kvm3j kernel: RIP: 0010:free_block.constprop.0+0xc0/0x1f0
>> Feb 02 15:22:27 kvm3j kernel: Code: 4c 8b 16 4c 89 d0 48 01 e8 0f 82 32 01 00 00 4c 89 f2 48 bb 00 00 00 00 00 ea ff ff 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 01 d8 <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 >
>> Feb 02 15:22:27 kvm3j kernel: RSP: 0018:ffffc9000252bdc8 EFLAGS: 00010086
>> Feb 02 15:22:27 kvm3j kernel: RAX: ffffebde00000000 RBX: ffffea0000000000 RCX: ffff888889141b00
>> Feb 02 15:22:27 kvm3j kernel: RDX: 0000777f80000000 RSI: ffff893d3edf3400 RDI: ffff8881000403c0
>> Feb 02 15:22:27 kvm3j kernel: RBP: 0000000080000000 R08: ffff888100041300 R09: 0000000000000003
>> Feb 02 15:22:27 kvm3j kernel: R10: 0000000000000000 R11: ffff888100041308 R12: dead000000000122
>> Feb 02 15:22:27 kvm3j kernel: R13: dead000000000100 R14: 0000777f80000000 R15: ffff893ed8780d60
>> Feb 02 15:22:27 kvm3j kernel: FS: 0000000000000000(0000) GS:ffff893d3edc0000(0000) knlGS:0000000000000000
>> Feb 02 15:22:27 kvm3j kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008 CR3: 000000048c4aa002 CR4: 00000000001726e0
>> Feb 02 15:22:27 kvm3j kernel: Call Trace:
>> Feb 02 15:22:27 kvm3j kernel: drain_array_locked.constprop.0+0x2e/0x80
>> Feb 02 15:22:27 kvm3j kernel: drain_array.constprop.0+0x54/0x70
>> Feb 02 15:22:27 kvm3j kernel: cache_reap+0x6c/0x100
>> Feb 02 15:22:27 kvm3j kernel: process_one_work+0x1cf/0x360
>> Feb 02 15:22:27 kvm3j kernel: worker_thread+0x45/0x3a0
>> Feb 02 15:22:27 kvm3j kernel: ? process_one_work+0x360/0x360
>> Feb 02 15:22:27 kvm3j kernel: kthread+0x116/0x130
>> Feb 02 15:22:27 kvm3j kernel: ? kthread_create_worker_on_cpu+0x40/0x40
>> Feb 02 15:22:27 kvm3j kernel: ret_from_fork+0x22/0x30
>> Feb 02 15:22:27 kvm3j kernel: Modules linked in: hpilo
>> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008
>> Feb 02 15:22:27 kvm3j kernel: ---[ end trace ded3153d86a92898 ]---
>> Feb 02 15:22:27 kvm3j kernel: RIP: 0010:free_block.constprop.0+0xc0/0x1f0
>> Feb 02 15:22:27 kvm3j kernel: Code: 4c 8b 16 4c 89 d0 48 01 e8 0f 82 32 01 00 00 4c 89 f2 48 bb 00 00 00 00 00 ea ff ff 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 01 d8 <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 >
>> Feb 02 15:22:27 kvm3j kernel: RSP: 0018:ffffc9000252bdc8 EFLAGS: 00010086
>> Feb 02 15:22:27 kvm3j kernel: RAX: ffffebde00000000 RBX: ffffea0000000000 RCX: ffff888889141b00
>> Feb 02 15:22:27 kvm3j kernel: RDX: 0000777f80000000 RSI: ffff893d3edf3400 RDI: ffff8881000403c0
>> Feb 02 15:22:27 kvm3j kernel: RBP: 0000000080000000 R08: ffff888100041300 R09: 0000000000000003
>> Feb 02 15:22:27 kvm3j kernel: R10: 0000000000000000 R11: ffff888100041308 R12: dead000000000122
>> Feb 02 15:22:27 kvm3j kernel: R13: dead000000000100 R14: 0000777f80000000 R15: ffff893ed8780d60
>> Feb 02 15:22:27 kvm3j kernel: FS: 0000000000000000(0000) GS:ffff893d3edc0000(0000) knlGS:0000000000000000
>> Feb 02 15:22:27 kvm3j kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Feb 02 15:22:27 kvm3j kernel: CR2: ffffebde00000008 CR3: 000000048c4aa002 CR4: 00000000001726e0
>
> Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)
>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply, that's in everyone's interest.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> ---
> Additional information about regzbot:
>
> If you want to know more about regzbot, check out its web-interface, the
> getting start guide, and/or the references documentation:
>
> https://linux-regtracking.leemhuis.info/regzbot/
> https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
> https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md
>
> The last two documents will explain how you can interact with regzbot
> yourself if your want to.
>
> Hint for reporters: when reporting a regression it's in your interest to
> tell #regzbot about it in the report, as that will ensure the regression
> gets on the radar of regzbot and the regression tracker. That's in your
> interest, as they will make sure the report won't fall through the
> cracks unnoticed.
>
> Hint for developers: you normally don't need to care about regzbot once
> it's involved. Fix the issue as you normally would, just remember to
> include a 'Link:' tag to the report in the commit message, as explained
> in Documentation/process/submitting-patches.rst
> That aspect was recently was made more explicit in commit 1f57bd42b77c:
> https://git.kernel.org/linus/1f57bd42b77c