Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

From: Guoqing Jiang
Date: Tue Feb 02 2021 - 10:47:26 EST

Next message: Sasha Levin: "[PATCH AUTOSEL 4.19 03/10] chtls: Fix potential resource leak"
Previous message: Catalin Marinas: "Re: [PATCH 12/12] arm64: kasan: export MTE symbols for KASAN tests"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Donald,

On 1/26/21 17:05, Donald Buczek wrote:

Dear Guoqing,

On 26.01.21 15:06, Guoqing Jiang wrote:

On 1/26/21 13:58, Donald Buczek wrote:

Hmm, how about wake the waiter up in the while loop of raid5d?

@@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)
                         md_check_recovery(mddev);
                         spin_lock_irq(&conf->device_lock);
                 }
+
+               if ((atomic_read(&conf->active_stripes)
+                    < (conf->max_nr_stripes * 3 / 4) ||
+                    (test_bit(MD_RECOVERY_INTR, &mddev->recovery))))
+                       wake_up(&conf->wait_for_stripe);
         }
         pr_debug("%d stripes handled\n", handled);

Hmm... With this patch on top of your other one, we still have the basic symptoms (md3_raid6 busy looping), but the sync thread is now hanging at

     root@sloth:~# cat /proc/$(pgrep md3_resync)/stack
     [<0>] md_do_sync.cold+0x8ec/0x97c
     [<0>] md_thread+0xab/0x160
     [<0>] kthread+0x11b/0x140
     [<0>] ret_from_fork+0x22/0x30

instead, which is https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L8963

Not sure why recovery_active is not zero, because it is set 0 before blk_start_plug, and raid5_sync_request returns 0 and skipped is also set to 1. Perhaps handle_stripe calls md_done_sync.

Could you double check the value of recovery_active? Or just don't wait if resync thread is interrupted.

wait_event(mddev->recovery_wait,
        test_bit(MD_RECOVERY_INTR,&mddev->recovery) ||
        !atomic_read(&mddev->recovery_active));

With that added, md3_resync goes into a loop, too. Not 100% busy, though.

root@sloth:~# cat /proc/$(pgrep md3_resync)/stack

[<0>] raid5_get_active_stripe+0x1e7/0x6b0 # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L735
[<0>] raid5_sync_request+0x2a7/0x3d0       # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/raid5.c#L6274
[<0>] md_do_sync.cold+0x3ee/0x97c          # https://elixir.bootlin.com/linux/v5.11-rc5/source/drivers/md/md.c#L8883
[<0>] md_thread+0xab/0x160
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

Sometimes top of stack is raid5_get_active_stripe+0x1ef/0x6b0 instead of raid5_get_active_stripe+0x1e7/0x6b0, so I guess it sleeps, its woken, but the conditions don't match so its sleeps again.

I don't know why the condition was not true after the change since the RECOVERY_INTR is set and the caller is raid5_sync_request.

wait_event_lock_irq(conf->wait_for_stripe,
(test_bit(MD_RECOVERY_INTR, &mddev->recovery) && sync_req) ||
/* the previous condition */,
*(conf->hash_locks + hash));

BTW, I think there some some possible ways:

1. let "echo idle" give up the reconfig_mutex if there are limited number of active stripe, but I feel it is ugly to check sh number from action_store (kind of layer violation).

2. make raid5_sync_request -> raid5_get_active_stripe can quit from the current situation (this was we tried, though it doesn't work so far).

3. set MD_ALLOW_SB_UPDATE as you said though I am not sure the safety (but maybe I am wrong).

4. given the write IO keeps coming from upper layer which decrease the available stripes. Maybe we need to call grow_one_stripe at the beginning of raid5_make_request for this case, then call drop_one_stripe
at the end of make_request.

5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this.

/* resync has finished, collect result */
mddev_unlock(mddev);
md_unregister_thread(&mddev->sync_thread);
mddev_lock(mddev);

My suggestion would be try 2 + 4 together since the reproducer triggers both sync io and write io. Or try 5. Perhaps there is better way which I just can't find.

Thanks,
Guoqing

Next message: Sasha Levin: "[PATCH AUTOSEL 4.19 03/10] chtls: Fix potential resource leak"
Previous message: Catalin Marinas: "Re: [PATCH 12/12] arm64: kasan: export MTE symbols for KASAN tests"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]