Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

From: Dan Moulding
Date: Sat Mar 02 2024 - 11:55:53 EST


> I have not root cause this yet, but would like share some findings from
> the vmcore Dan shared. From what i can see, this doesn't look like a md
> issue, but something wrong with block layer or below.

Below is one other thing I found that might be of interest. This is
from the original email thread [1] that was linked to in the original
issue from 2022, which the change in question reverts:

On 2022-09-02 17:46, Logan Gunthorpe wrote:
> I've made some progress on this nasty bug. I've got far enough to know it's not
> related to the blk-wbt or the block layer.
>
> Turns out a bunch of bios are stuck queued in a blk_plug in the md_raid5
> thread while that thread appears to be stuck in an infinite loop (so it never
> schedules or does anything to flush the plug).
>
> I'm still debugging to try and find out the root cause of that infinite loop,
> but I just wanted to send an update that the previous place I was stuck at
> was not correct.
>
> Logan

This certainly sounds like it has some similarities to what we are
seeing when that change is reverted. The md0_raid5 thread appears to be
in an infinite loop, consuming 100% CPU, but not actually doing any
work.

-- Dan

[1] https://lore.kernel.org/r/7f3b87b6-b52a-f737-51d7-a4eec5c44112@xxxxxxxxxxxx