[PATCH] block: don't read block device if it's invalid

From: Xianting Tian
Date: Tue Aug 11 2020 - 09:43:49 EST


We found several processes in 'D' state after nvme device hot-removed,
The call trace as below, we can see process 848 got lock 'bdev->bd_mutex'
in blkdev_reread_part(), but scheduled out due to wait for IO done. But
the IO won't be completed as the device is hot-removed. Then it caused
the lock 'bdev->bd_mutex' can't be unlocked. As a result, it caused
other processes, which need to get the same lock 'bdev->bd_mutex',
blocked on this lock.

When nvme device hot-removed, kernel will start a thread to handle the
task of nvme device removing, as the call trace of process 1111504 shows
below. I listed the call trace of nvme_kill_queues() in detail as below,
we can see 'NVME_NS_DEAD' is set, then when executing
nvme_revalidate_disk(), it found 'NVME_NS_DEAD' is set and
'set_capacity(disk, 0)' will be called to set disk capacity to 0.
nvme_kill_queues()
if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) return;
revalidate_disk(disk)
disk->fops->revalidate_disk(disk) <=for nvme device, revalidate_disk=nvme_revalidate_disk()
mutex_lock(&bdev->bd_mutex)

This patch is to reduce the probability of such problem. Before getting
the lock of 'bdev->bd_mutex' in blkdev_reread_part(), add the code to
check if the capacity of the disk is 0, just return. Then we can avoid
the happen of the issue:
nvme device is hot-removed, and its capacity is alreday set to 0; then
if there is process like 848 want to read the device, it will return
directly in blkdev_reread_part(), then it will not get the lock
"bdev->bd_mutex", which can't be unlocked by the process itself as IO
can't be completed.

cat /proc/848/stack
[<ffffffffb14b9d56>] io_schedule+0x16/0x40
[<ffffffffb15b009e>] do_read_cache_page+0x3ee/0x5e0
[<ffffffffb15b02a5>] read_cache_page+0x15/0x20
[<ffffffffb17810ad>] read_dev_sector+0x2d/0xa0
[<ffffffffb17856e4>] read_lba+0x104/0x1c0
[<ffffffffb1785d1a>] find_valid_gpt+0xfa/0x720
[<ffffffffb17863c9>] efi_partition+0x89/0x430
[<ffffffffb1783280>] check_partition+0x100/0x1f0
[<ffffffffb17819e4>] rescan_partitions+0xb4/0x360
[<ffffffffb177d544>] __blkdev_reread_part+0x64/0x70
[<ffffffffb177d573>] blkdev_reread_part+0x23/0x40 <<==mutex_lock(&bdev->bd_mutex);
[<ffffffffb177e00b>] blkdev_ioctl+0x44b/0x8e0
[<ffffffffb1692dc1>] block_ioctl+0x41/0x50
[<ffffffffb1668d37>] do_vfs_ioctl+0xa7/0x5e0
[<ffffffffb16692e9>] SyS_ioctl+0x79/0x90
[<ffffffffb1bd6001>] entry_SYSCALL_64_fastpath+0x1f/0xb9
[<ffffffffffffffff>] 0xffffffffffffffff

cat /proc/1111504/stack
[<ffffffffb1692fb9>] revalidate_disk+0x49/0x80 <<==mutex_lock(&bdev->bd_mutex);
[<ffffffffc0112ff2>] nvme_kill_queues+0x52/0x80 [nvme_core]
[<ffffffffc0116304>] nvme_remove_namespaces+0x44/0x50 [nvme_core]
[<ffffffffc0050805>] nvme_remove+0x85/0x130 [nvme]
[<ffffffffb17e9a69>] pci_device_remove+0x39/0xc0
[<ffffffffb18f9f91>] device_release_driver_internal+0x141/0x210
[<ffffffffb18fa072>] device_release_driver+0x12/0x20
[<ffffffffb17e1b3c>] pci_stop_bus_device+0x8c/0xa0
[<ffffffffb17e1cd2>] pci_stop_and_remove_bus_device+0x12/0x20
[<ffffffffb17fdeaa>] pciehp_unconfigure_device+0x7a/0x1e0
[<ffffffffb17fd9b2>] pciehp_disable_slot+0x52/0xd0
[<ffffffffb17fdaba>] pciehp_power_thread+0x8a/0xb0
[<ffffffffb14a556e>] process_one_work+0x14e/0x370
[<ffffffffb14a601d>] worker_thread+0x4d/0x3f0
[<ffffffffb14abd09>] kthread+0x109/0x140
[<ffffffffb1bd627a>] ret_from_fork+0x2a/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

cat /proc/1197767/stack
[<ffffffffb169444e>] __blkdev_get+0x6e/0x450 <<==mutex_lock_nested(&bdev->bd_mutex, for_part);
[<ffffffffb16949d4>] blkdev_get+0x1a4/0x300
[<ffffffffb1694f1a>] blkdev_open+0x7a/0xa0
[<ffffffffb164f93f>] do_dentry_open+0x20f/0x330
[<ffffffffb1650e10>] vfs_open+0x50/0x70
[<ffffffffb16621f8>] path_openat+0x548/0x13b0
[<ffffffffb1664d71>] do_filp_open+0x91/0x100
[<ffffffffb16511e4>] do_sys_open+0x124/0x210
[<ffffffffb16512ee>] SyS_open+0x1e/0x20
[<ffffffffb140396c>] do_syscall_64+0x6c/0x1b0
[<ffffffffb1bd60c0>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

ps -eo pid,comm,state | grep ' D'
848 systemd-udevd D
1111504 kworker/10:1 D
1197767 isdct D
1198830 isdct D
1580322 xxd D
1616804 kworker/10:0 D
1626264 isdct D
1734726 kworker/10:2 D
2197993 isdct D
2662117 xxd D
3083718 xxd D
3189834 xxd D

Signed-off-by: Xianting Tian <xianting_tian@xxxxxxx>
---
block/ioctl.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/block/ioctl.c b/block/ioctl.c
index bdb3bbb..159bceb 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -94,6 +94,9 @@ static int blkdev_reread_part(struct block_device *bdev)
{
int ret;

+ if (unlikely(!get_capacity(bdev->bd_disk)))
+ return -EIO;
+
if (!disk_part_scan_enabled(bdev->bd_disk) || bdev != bdev->bd_contains)
return -EINVAL;
if (!capable(CAP_SYS_ADMIN))
--
1.8.3.1