Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2

From: Xiao Ni
Date: Sun Mar 03 2024 - 20:26:26 EST


On Mon, Mar 4, 2024 at 9:24 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2024/03/04 9:07, Yu Kuai 写道:
> > Hi,
> >
> > 在 2024/03/03 21:16, Xiao Ni 写道:
> >> Hi all
> >>
> >> There is a error report from lvm regression tests. The case is
> >> lvconvert-raid-reshape-stripes-load-reload.sh. I saw this error when I
> >> tried to fix dmraid regression problems too. In my patch set, after
> >> reverting ad39c08186f8a0f221337985036ba86731d6aafe (md: Don't register
> >> sync_thread for reshape directly), this problem doesn't appear.
> >

Hi Kuai
> > How often did you see this tes failed? I'm running the tests for over
> > two days now, for 30+ rounds, and this test never fail in my VM.

I ran 5 times and it failed 2 times just now.

>
> Take a quick look, there is still a path from raid10 that
> MD_RECOVERY_FROZEN can be cleared, and in theroy this problem can be
> triggered. Can you test the following patch on the top of this set?
> I'll keep running the test myself.

Sure, I'll give the result later.

Regards
Xiao
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a5f8419e2df1..7ca29469123a 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -4575,7 +4575,8 @@ static int raid10_start_reshape(struct mddev *mddev)
> return 0;
>
> abort:
> - mddev->recovery = 0;
> + if (mddev->gendisk)
> + mddev->recovery = 0;
> spin_lock_irq(&conf->device_lock);
> conf->geo = conf->prev;
> mddev->raid_disks = conf->geo.raid_disks;
>
> Thanks,
> Kuai
> >
> > Thanks,
> > Kuai
> >
> >>
> >> I put the log in the attachment.
> >>
> >> On Fri, Mar 1, 2024 at 6:03 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>>
> >>> From: Yu Kuai <yukuai3@xxxxxxxxxx>
> >>>
> >>> link to part1:
> >>> https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@xxxxxxxxxxxxxx/
> >>>
> >>>
> >>> part1 contains fixes for deadlocks for stopping sync_thread
> >>>
> >>> This set contains fixes:
> >>> - reshape can start unexpected, cause data corruption, patch 1,5,6;
> >>> - deadlocks that reshape concurrent with IO, patch 8;
> >>> - a lockdep warning, patch 9;
> >>>
> >>> I'm runing lvm2 tests with following scripts with a few rounds now,
> >>>
> >>> for t in `ls test/shell`; do
> >>> if cat test/shell/$t | grep raid &> /dev/null; then
> >>> make check T=shell/$t
> >>> fi
> >>> done
> >>>
> >>> There are no deadlock and no fs corrupt now, however, there are still
> >>> four
> >>> failed tests:
> >>>
> >>> ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh
> >>> ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
> >>> ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
> >>> ### failed: [ndev-vanilla] shell/lvextend-raid.sh
> >>>
> >>> And failed reasons are the same:
> >>>
> >>> ## ERROR: The test started dmeventd (147856) unexpectedly
> >>>
> >>> I have no clue yet, and it seems other folks doesn't have this issue.
> >>>
> >>> Yu Kuai (9):
> >>> md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
> >>> md: export helpers to stop sync_thread
> >>> md: export helper md_is_rdwr()
> >>> md: add a new helper reshape_interrupted()
> >>> dm-raid: really frozen sync_thread during suspend
> >>> md/dm-raid: don't call md_reap_sync_thread() directly
> >>> dm-raid: add a new helper prepare_suspend() in md_personality
> >>> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io
> >>> concurrent with reshape
> >>> dm-raid: fix lockdep waring in "pers->hot_add_disk"
> >>>
> >>> drivers/md/dm-raid.c | 93 ++++++++++++++++++++++++++++++++++----------
> >>> drivers/md/md.c | 73 ++++++++++++++++++++++++++--------
> >>> drivers/md/md.h | 38 +++++++++++++++++-
> >>> drivers/md/raid5.c | 32 ++++++++++++++-
> >>> 4 files changed, 196 insertions(+), 40 deletions(-)
> >>>
> >>> --
> >>> 2.39.2
> >>>
> >
> >
> > .
> >
>