Re: [dm-devel] [PATCH -next v2 4/6] md: refactor idle/frozen_sync_thread() to fix deadlock

From: Yu Kuai
Date: Tue Jun 13 2023 - 21:49:02 EST


Hi,

在 2023/06/13 22:50, Xiao Ni 写道:

在 2023/5/29 下午9:20, Yu Kuai 写道:
From: Yu Kuai <yukuai3@xxxxxxxxxx>

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

   raid10_end_write_request
    set_bit(R10BIO_WriteError, &r10_bio->state)
    one_write_done
     reschedule_retry

   // later from md thread
   raid10d
    handle_write_completed
     list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

   // later from md thread
   raid10d
    if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
     list_move(conf->bio_end_io_list.prev, &tmp)
     r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
     raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

Hi Kuai

It looks like the above situation is more complex. It only needs a normal write and md_write_start needs to

wait until the metadata is written to member disks, right? If so, it doesn't need to introduce raid10 write failure

here. I guess, it should be your test case. It's nice, if you can put your test steps in the patch. But for the analysis

of the deadlock here, it's better to be simple.

Test script can be found here, it's pretty easy to trigger:

https://patchwork.kernel.org/project/linux-raid/patch/20230529132826.2125392-4-yukuai1@xxxxxxxxxxxxxxx/

While reviewing the related code, I found that io can only be added to
list bio_end_io_list from handle_write_completed() if such io failed, so
I think io failure is needed to trigger deadlock from daemon thread.

I think the key point is how MD_SB_CHANGE_PENDING is set:

1) raid10_error() and rdev_set_badblocks(), trigger by io failure;
2) raid10_write_request() related to reshape;
3) md_write_start() and md_allow_write(), and mddev->in_sync is set,
however, I was thinking this is not a common case;

1) is used here because it's quite easy to trigger and this is what
we meet in real test. 3) is possible but I will say let's keep 1), I
don't think it's necessary to reporduce this deadlock through another
path again.

Thanks,
Kuai


2) Trigger a recovery:

   raid10_sync_request
    raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

   action_store
    mddev_lock
     md_unregister_thread
      kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

   raid10d
    md_check_recovery
     if (mddev_trylock(mddev))
      md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.

In the patch, sync_seq is added in md_reap_sync_thread. In idle_sync_thread, if sync_seq isn't equal

mddev->sync_seq, it should mean there is someone that stops the sync thread already, right? Why do

you say 'new started sync thread' here?

Regards

Xiao



Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.

[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@xxxxxxxxxxxxx/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@xxxxxxxxxxxxxxx/
Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
  drivers/md/md.c | 23 +++++++++++++++++++----
  drivers/md/md.h |  2 ++
  2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 63a993b52cd7..7912de0e4d12 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -652,6 +652,7 @@ void mddev_init(struct mddev *mddev)
      timer_setup(&mddev->safemode_timer, md_safemode_timeout, 0);
      atomic_set(&mddev->active, 1);
      atomic_set(&mddev->openers, 0);
+    atomic_set(&mddev->sync_seq, 0);
      spin_lock_init(&mddev->lock);
      atomic_set(&mddev->flush_pending, 0);
      init_waitqueue_head(&mddev->sb_wait);
@@ -4776,19 +4777,27 @@ static void stop_sync_thread(struct mddev *mddev)
      if (work_pending(&mddev->del_work))
          flush_workqueue(md_misc_wq);
-    if (mddev->sync_thread) {
-        set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-        md_reap_sync_thread(mddev);
-    }
+    set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+    /*
+     * Thread might be blocked waiting for metadata update which will now
+     * never happen
+     */
+    md_wakeup_thread_directly(mddev->sync_thread);
      mddev_unlock(mddev);
  }
  static void idle_sync_thread(struct mddev *mddev)
  {
+    int sync_seq = atomic_read(&mddev->sync_seq);
+
      mutex_lock(&mddev->sync_mutex);
      clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
      stop_sync_thread(mddev);
+
+    wait_event(resync_wait, sync_seq != atomic_read(&mddev->sync_seq) ||
+            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
      mutex_unlock(&mddev->sync_mutex);
  }
@@ -4797,6 +4806,10 @@ static void frozen_sync_thread(struct mddev *mddev)
      mutex_init(&mddev->delete_mutex);
      set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
      stop_sync_thread(mddev);
+
+    wait_event(resync_wait, mddev->sync_thread == NULL &&
+            !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
      mutex_unlock(&mddev->sync_mutex);
  }
@@ -9472,6 +9485,8 @@ void md_reap_sync_thread(struct mddev *mddev)
      /* resync has finished, collect result */
      md_unregister_thread(&mddev->sync_thread);
+    atomic_inc(&mddev->sync_seq);
+
      if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) &&
          !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) &&
          mddev->degraded != mddev->raid_disks) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2fa903de5bd0..7cab9c7c45b8 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -539,6 +539,8 @@ struct mddev {
      /* Used to synchronize idle and frozen for action_store() */
      struct mutex            sync_mutex;
+    /* The sequence number for sync thread */
+    atomic_t sync_seq;
      bool    has_superblocks:1;
      bool    fail_last_dev:1;

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel