Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

From: Blazej Kucman
Date: Wed Jan 31 2024 - 08:37:06 EST


On Tue, 30 Jan 2024 20:55:39 -0800
Song Liu <song@xxxxxxxxxx> wrote:

> On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx>
> >
> > Can you test the following patch?
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index e3a56a958b47..a8db84c200fe 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct
> > *ws) rcu_read_lock();
> > }
> > rcu_read_unlock();
> > - if (atomic_dec_and_test(&mddev->flush_pending))
> > + if (atomic_dec_and_test(&mddev->flush_pending)) {
> > + /* The pair is percpu_ref_get() from
> > md_flush_request() */
> > + percpu_ref_put(&mddev->active_io);
> > +
> > queue_work(md_wq, &mddev->flush_work);
> > + }
> > }
> >
> > static void md_submit_flush_data(struct work_struct *ws)
>
> This fixes the issue in my tests. Please submit the official patch.
> Also, we should add a test in mdadm/tests to cover this case.
>
> Thanks,
> Song
>

Hi Kuai,

On my hardware issue also stopped reproducing with this fix.

I applied the fix on current HEAD of master
branch in kernel/git/torvalds/linux.git repo.

Thansk,
Blazej