Re: [PATCH] block: fix deadlock between bd_link_disk_holder and partition scan

From: Yu Kuai
Date: Sun Feb 18 2024 - 02:47:32 EST


Hi,

在 2024/02/17 3:03, Song Liu 写道:
On Thu, Feb 8, 2024 at 4:49 PM Song Liu <song@xxxxxxxxxx> wrote:

On Thu, Feb 8, 2024 at 12:44 AM Li Nan <linan666@xxxxxxxxxxxxxxx> wrote:



在 2024/2/8 14:50, Song Liu 写道:
On Wed, Feb 7, 2024 at 1:32 AM <linan666@xxxxxxxxxxxxxxx> wrote:

From: Li Nan <linan122@xxxxxxxxxx>

'open_mutex' of gendisk is used to protect open/close block devices. But
in bd_link_disk_holder(), it is used to protect the creation of symlink
between holding disk and slave bdev, which introduces some issues.

When bd_link_disk_holder() is called, the driver is usually in the process
of initialization/modification and may suspend submitting io. At this
time, any io hold 'open_mutex', such as scanning partitions, can cause
deadlocks. For example, in raid:

T1 T2
bdev_open_by_dev
lock open_mutex [1]
...
efi_partition
...
md_submit_bio
md_ioctl mddev_syspend
-> suspend all io
md_add_new_disk
bind_rdev_to_array
bd_link_disk_holder
try lock open_mutex [2]
md_handle_request
-> wait mddev_resume

T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume
mddev, but T2 waits for open_mutex held by T1. Deadlock occurs.

Fix it by introducing a local mutex 'holder_mutex' to replace 'open_mutex'.

Is this to fix [1]? Do we need some Fixes and/or Closes tags?


No. Just use another way to fix [2], and both [2] and this patch can fix
the issue. I am not sure about the root cause of [1] yet.

[2] https://patchwork.kernel.org/project/linux-raid/list/?series=812045

Could you please add steps to reproduce this issue?

We need to modify the kernel, add sleep in md_submit_bio() and md_ioctl()
as below, and then:
1. mdadm -CR /dev/md0 -l1 -n2 /dev/sd[bc] #create a raid
2. echo 1 > /sys/module/md_mod/parameters/error_inject #enable sleep
3. 'mdadm --add /dev/md0 /dev/sda' #add a disk to raid
4. submit ioctl BLKRRPART to raid within 10s.

The analysis makes sense. I also hit the issue a couple times without adding
extra delays. But I am not sure whether this is the best fix (I didn't find real
issues with it either).

To be extra safe and future proof, we can do something like the
following to only
suspend the array for ADD_NEW_DISK on not-running arrays.

This appear to solve the problem reported in

https://bugzilla.kernel.org/show_bug.cgi?id=218459

Thanks,
Song

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 9e41a9aaba8b..395911d5f4d6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7570,10 +7570,11 @@ static inline bool md_ioctl_valid(unsigned int cmd)
}
}

-static bool md_ioctl_need_suspend(unsigned int cmd)
+static bool md_ioctl_need_suspend(struct mddev *mddev, unsigned int cmd)
{
switch (cmd) {
case ADD_NEW_DISK:
+ return mddev->pers != NULL;

Did you check already that this problem is not related that 'active_io'
is leaked for flush IO?

I don't understand the problem reported yet. If 'mddev->pers' is not set
yet, md_submit_bio() will return directly, and 'active_io' should not be
grabbed in the first place.

md_run() is the only place to convert 'mddev->pers' from NULL to a real
personality, and it's protected by 'reconfig_mutex', however,
md_ioctl_need_suspend() is called without 'reconfig_mutex', hence there
is a race condition:

md_ioctl_need_suspend array_state_store
// mddev->pers is NULL, return false
mddev_lock
do_md_run
mddev->pers = xxx
mddev_unlock

// mddev_suspend is not called
mddev_lock
md_add_new_disk
if (mddev->pers)
md_import_device
bind_rdev_to_array
add_bound_rdev
mddev->pers->hot_add_disk
-> hot add disk without suspending

Thanks,
Kuai

case HOT_ADD_DISK:
case HOT_REMOVE_DISK:
case SET_BITMAP_FILE:
@@ -7625,6 +7626,7 @@ static int md_ioctl(struct block_device *bdev,
blk_mode_t mode,
void __user *argp = (void __user *)arg;
struct mddev *mddev = NULL;
bool did_set_md_closing = false;
+ bool need_suspend;

if (!md_ioctl_valid(cmd))
return -ENOTTY;
@@ -7716,8 +7718,11 @@ static int md_ioctl(struct block_device *bdev,
blk_mode_t mode,
if (!md_is_rdwr(mddev))
flush_work(&mddev->sync_work);

- err = md_ioctl_need_suspend(cmd) ? mddev_suspend_and_lock(mddev) :
- mddev_lock(mddev);
+ need_suspend = md_ioctl_need_suspend(mddev, cmd);
+ if (need_suspend)
+ err = mddev_suspend_and_lock(mddev);
+ else
+ err = mddev_lock(mddev);
if (err) {
pr_debug("md: ioctl lock interrupted, reason %d, cmd %d\n",
err, cmd);
@@ -7846,8 +7851,10 @@ static int md_ioctl(struct block_device *bdev,
blk_mode_t mode,
err != -EINVAL)
mddev->hold_active = 0;

- md_ioctl_need_suspend(cmd) ? mddev_unlock_and_resume(mddev) :
- mddev_unlock(mddev);
+ if (need_suspend)
+ mddev_unlock_and_resume(mddev);
+ else
+ mddev_unlock(mddev);

out:
if(did_set_md_closing)
.