Re: panic with CPU hotplug + blk-mq + scsi-mq

From: Ming Lei
Date: Sat Apr 18 2015 - 00:25:19 EST


Hi Dongsu,

On Fri, Apr 17, 2015 at 5:41 AM, Dongsu Park
<dongsu.park@xxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq.
> Every time when a CPU is offlined, some arbitrary range of kernel memory
> seems to get corrupted. Then after a while, kernel panics at random places
> when block IOs are issued. (for example, see the call traces below)

Thanks for the report.

>
> This bug can be easily reproducible with a Qemu VM running with virtio-scsi,
> when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded
> with blk-mq enabled. And yes, 4.0 release is still affected, as well as
> Jens' for-4.1/core. How to reproduce:
>
> # echo 0 > /sys/devices/system/cpu/cpu1/online
> (and issue some block IOs, that's it.)
>
> Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden
> until commit ccbedf117f01 ("virtio_scsi: support multi hw queue of blk-mq"),
> which started to allow virtio-scsi to map virtqueues to hardware queues of
> blk-mq. Reverting that commit makes the bug go away. However, I suppose
> reverting it could not be a correct solution.

I agree, and that patch only enables multiple hw queues.

>
> More precisely, every time a CPU hotplug event gets triggered,
> a call graph is like the following:
>
> blk_mq_queue_reinit_notify()
> -> blk_mq_queue_reinit()
> -> blk_mq_map_swqueue()
> -> blk_mq_free_rq_map()
> -> scsi_exit_request()
>
> From that point, as soon as any address in the request gets modified, an
> arbitrary range of memory gets corrupted. My first guess was that probably
> the exit routine could try to deallocate tags->rqs[] where invalid
> addresses are stored. But actually it looks like it's not the case,
> and cmd->sense_buffer looks also valid.
> It's not obvious to me, exactly what could go wrong.
>
> Does anyone have an idea?

As far as I can see, at least two problems exist:
- race between timeout and CPU hotplug
- in case of shared tags, during CPU online handling, about setting
and checking hctx->tags

So could you please test the attached two patches to see if they fix your issue?

I run them in my VM, and looks opps does disappear.


Thanks,
Ming Lei
>
> Regards,
> Dongsu
>
> ---- [beginning of call traces] ----
> [ 47.274292] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [ 47.275013] IP: [<ffffffff8140b31d>] __bt_get.isra.5+0x7d/0x1e0
> [ 47.275013] PGD 79c55067 PUD 7ba17067 PMD 0
> [ 47.275013] Oops: 0000 [#1] SMP
> [ 47.275013] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq serio_raw i2c_piix4 virtio_net
> [ 47.275013] CPU: 3 PID: 6232 Comm: blkid Not tainted 4.0.0 #303
> [ 47.275013] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
> [ 47.275013] task: ffff88003dfbc020 ti: ffff880079bac000 task.ti: ffff880079bac000
> [ 47.275013] RIP: 0010:[<ffffffff8140b31d>] [<ffffffff8140b31d>] __bt_get.isra.5+0x7d/0x1e0
> [ 47.275013] RSP: 0018:ffff880079baf898 EFLAGS: 00010246
> [ 47.275013] RAX: 000000000000003c RBX: ffff880079198400 RCX: 0000000000000078
> [ 47.275013] RDX: ffff88007fddbb80 RSI: 0000000000000010 RDI: ffff880079198400
> [ 47.275013] RBP: ffff880079baf8e8 R08: ffff88007fddbb80 R09: 0000000000000000
> [ 47.275013] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010
> [ 47.275013] R13: 0000000000000010 R14: ffff880079baf9e8 R15: ffff88007fddbb80
> [ 47.275013] FS: 00002b270c049800(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [ 47.275013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 47.275013] CR2: 0000000000000018 CR3: 000000007ca8d000 CR4: 00000000001407e0
> [ 47.275013] Stack:
> [ 47.275013] ffff880079baf978 ffff88007fdd58c0 0000000000000078 ffffffff814071ff
> [ 47.275013] ffff880079baf8d8 ffff880079198400 0000000000000010 0000000000000010
> [ 47.275013] ffff880079baf9e8 ffff88007fddbb80 ffff880079baf968 ffffffff8140b4e5
> [ 47.275013] Call Trace:
> [ 47.275013] [<ffffffff814071ff>] ? blk_mq_queue_enter+0x9f/0x2d0
> [ 47.275013] [<ffffffff8140b4e5>] bt_get+0x65/0x1e0
> [ 47.275013] [<ffffffff814071ff>] ? blk_mq_queue_enter+0x9f/0x2d0
> [ 47.275013] [<ffffffff810c9b40>] ? wait_woken+0xa0/0xa0
> [ 47.275013] [<ffffffff8140ba07>] blk_mq_get_tag+0xa7/0xd0
> [ 47.275013] [<ffffffff8140630b>] __blk_mq_alloc_request+0x1b/0x200
> [ 47.275013] [<ffffffff81408736>] blk_mq_map_request+0xd6/0x4e0
> [ 47.275013] [<ffffffff8140a53e>] blk_mq_make_request+0x6e/0x2d0
> [ 47.275013] [<ffffffff813fb844>] ? generic_make_request_checks+0x674/0x6a0
> [ 47.275013] [<ffffffff813f23ae>] ? bio_add_page+0x5e/0x70
> [ 47.275013] [<ffffffff813fb930>] generic_make_request+0xc0/0x110
> [ 47.275013] [<ffffffff813fb9e8>] submit_bio+0x68/0x150
> [ 47.275013] [<ffffffff811b0c6c>] ? lru_cache_add+0x1c/0x50
> [ 47.275013] [<ffffffff8125972a>] mpage_bio_submit+0x2a/0x40
> [ 47.275013] [<ffffffff8125a81c>] mpage_readpages+0x10c/0x130
> [ 47.275013] [<ffffffff81254040>] ? I_BDEV+0x10/0x10
> [ 47.275013] [<ffffffff81254040>] ? I_BDEV+0x10/0x10
> [ 47.275013] [<ffffffff8119e417>] ? __page_cache_alloc+0x137/0x160
> [ 47.275013] [<ffffffff8125486d>] blkdev_readpages+0x1d/0x20
> [ 47.275013] [<ffffffff811ae43f>] __do_page_cache_readahead+0x29f/0x320
> [ 47.275013] [<ffffffff811ae305>] ? __do_page_cache_readahead+0x165/0x320
> [ 47.275013] [<ffffffff811aea14>] force_page_cache_readahead+0x34/0x60
> [ 47.275013] [<ffffffff811aea86>] page_cache_sync_readahead+0x46/0x50
> [ 47.275013] [<ffffffff811a094c>] generic_file_read_iter+0x52c/0x640
> [ 47.275013] [<ffffffff81254c17>] blkdev_read_iter+0x37/0x40
> [ 47.275013] [<ffffffff81211a0e>] new_sync_read+0x7e/0xb0
> [ 47.275013] [<ffffffff81212c88>] __vfs_read+0x18/0x50
> [ 47.275013] [<ffffffff81212d4d>] vfs_read+0x8d/0x150
> [ 47.275013] [<ffffffff81212e59>] SyS_read+0x49/0xb0
> [ 47.275013] [<ffffffff817063b2>] system_call_fastpath+0x12/0x17
> [ 47.275013] Code: 97 18 03 00 00 bf 04 00 00 00 41 f7 f1 83 f8 04 0f 43 f8 b8 ff ff ff ff 44 39 d7 0f 86 c1 00 00 00 41 8b 00 48 89 4d c0 49 89 f5 <8b> 4e 08 8b 56 0c 4c 89 45 b0 c7 45 c8 00 00 00 00 41 89 c4 89
> [ 47.275013] RIP [<ffffffff8140b31d>] __bt_get.isra.5+0x7d/0x1e0
> [ 47.275013] RSP <ffff880079baf898>
> [ 47.275013] CR2: 0000000000000018
> [ 47.275013] ---[ end trace 9a650b674f0fae74 ]---
> [ 47.701261] note: kworker/3:2[225] exited with preempt_count 1
> [ 47.815398] BUG: unable to handle kernel paging request at ffffffffffffff98
> [ 47.816324] IP: [<ffffffff810a00d0>] kthread_data+0x10/0x20
> [ 47.816324] PGD 1c0c067 PUD 1c0e067 PMD 0
> [ 47.816324] Oops: 0000 [#3] SMP
> [ 47.816324] Modules linked in: fuse cpufreq_stats binfmt_misc 9p fscache dm_round_robin loop dm_multipath 9pnet_virtio rtc_cmos 9pnet acpi_cpufreq serio_raw i2c_piix4 virtio_net
> [ 47.816324] CPU: 3 PID: 225 Comm: kworker/3:2 Tainted: G D W 4.0.0 #303
> [ 47.816324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
> [ 47.816324] task: ffff88007ac90000 ti: ffff88007906c000 task.ti: ffff88007906c000
> [ 47.816324] RIP: 0010:[<ffffffff810a00d0>] [<ffffffff810a00d0>] kthread_data+0x10/0x20
> [ 47.816324] RSP: 0018:ffff88007906f5e8 EFLAGS: 00010092
> [ 47.816324] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 000000000000000f
> [ 47.816324] RDX: 000000000000000f RSI: 0000000000000003 RDI: ffff88007ac90000
> [ 47.816324] RBP: ffff88007906f5e8 R08: ffff88007ac90090 R09: 0000000000000000
> [ 47.816324] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88007fdd4dc0
> [ 47.816324] R13: ffff88007ac90000 R14: 0000000000000003 R15: 0000000000000000
> [ 47.816324] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [ 47.816324] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 47.816324] CR2: 0000000000000028 CR3: 0000000001c0b000 CR4: 00000000001407e0
> [ 47.816324] Stack:
> [ 47.816324] ffff88007906f608 ffffffff81099f35 ffff88007906f608 00000000001d4dc0
> [ 47.816324] ffff88007906f678 ffffffff816ff757 ffffffff8107cfc6 0000000000000292
> [ 47.816324] ffffffff8107cffa ffff88007ac90000 ffff88007906f678 0000000000000296
> [ 47.816324] Call Trace:
> [ 47.816324] [<ffffffff81099f35>] wq_worker_sleeping+0x15/0xa0
> [ 47.816324] [<ffffffff816ff757>] __schedule+0xa77/0x1080
> [ 47.816324] [<ffffffff8107cfc6>] ? do_exit+0x756/0xbf0
> [ 47.816324] [<ffffffff8107cffa>] ? do_exit+0x78a/0xbf0
> [ 47.816324] [<ffffffff816ffd97>] schedule+0x37/0x90
> [ 47.816324] [<ffffffff8107d0d6>] do_exit+0x866/0xbf0
> [ 47.816324] [<ffffffff810ec14e>] ? kmsg_dump+0xfe/0x200
> [ 47.816324] [<ffffffff810068ad>] oops_end+0x8d/0xd0
> [ 47.816324] [<ffffffff81047849>] no_context+0x119/0x370
> [ 47.816324] [<ffffffff810ce795>] ? cpuacct_charge+0x5/0x1c0
> [ 47.816324] [<ffffffff810b4a25>] ? sched_clock_local+0x25/0x90
> [ 47.816324] [<ffffffff81047b25>] __bad_area_nosemaphore+0x85/0x210
> [ 47.816324] [<ffffffff81047cc3>] bad_area_nosemaphore+0x13/0x20
> [ 47.816324] [<ffffffff81047fb6>] __do_page_fault+0xb6/0x490
> [ 47.816324] [<ffffffff8104839c>] do_page_fault+0xc/0x10
> [ 47.816324] [<ffffffff817080c2>] page_fault+0x22/0x30
> [ 47.816324] [<ffffffff8140b31d>] ? __bt_get.isra.5+0x7d/0x1e0
> [ 47.816324] [<ffffffff8140b4e5>] bt_get+0x65/0x1e0
> [ 47.816324] [<ffffffff810c9b40>] ? wait_woken+0xa0/0xa0
> [ 47.816324] [<ffffffff8140ba07>] blk_mq_get_tag+0xa7/0xd0
> [ 47.816324] [<ffffffff8140630b>] __blk_mq_alloc_request+0x1b/0x200
> [ 47.816324] [<ffffffff81407f91>] blk_mq_alloc_request+0xa1/0x250
> [ 47.816324] [<ffffffff813fc74c>] blk_get_request+0x2c/0xf0
> [ 47.816324] [<ffffffff810a6acd>] ? __might_sleep+0x4d/0x90
> [ 47.816324] [<ffffffff815747dd>] scsi_execute+0x3d/0x1f0
> [ 47.816324] [<ffffffff815763be>] scsi_execute_req_flags+0x8e/0x100
> [ 47.816324] [<ffffffff81576a43>] scsi_test_unit_ready+0x83/0x130
> [ 47.816324] [<ffffffff8158672e>] sd_check_events+0x14e/0x1b0
> [ 47.816324] [<ffffffff8140e731>] disk_check_events+0x51/0x170
> [ 47.816324] [<ffffffff8140e86c>] disk_events_workfn+0x1c/0x20
> [ 47.816324] [<ffffffff81099128>] process_one_work+0x1e8/0x800
> [ 47.816324] [<ffffffff8109909d>] ? process_one_work+0x15d/0x800
> [ 47.816324] [<ffffffff8109981a>] ? worker_thread+0xda/0x470
> [ 47.816324] [<ffffffff81099793>] worker_thread+0x53/0x470
> [ 47.816324] [<ffffffff81099740>] ? process_one_work+0x800/0x800
> [ 47.816324] [<ffffffff81099740>] ? process_one_work+0x800/0x800
> [ 47.816324] [<ffffffff8109f652>] kthread+0xf2/0x110
> [ 47.816324] [<ffffffff810d3d4d>] ? trace_hardirqs_on+0xd/0x10
> [ 47.816324] [<ffffffff8109f560>] ? kthread_create_on_node+0x230/0x230
> [ 47.816324] [<ffffffff81706308>] ret_from_fork+0x58/0x90
> [ 47.816324] [<ffffffff8109f560>] ? kthread_create_on_node+0x230/0x230
> [ 47.816324] Code: 00 48 89 e5 5d 48 8b 40 88 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 87 20 04 00 00 55 48 89 e5 <48> 8b 40 98 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
> [ 47.816324] RIP [<ffffffff810a00d0>] kthread_data+0x10/0x20
> [ 47.816324] RSP <ffff88007906f5e8>
> [ 47.816324] CR2: ffffffffffffff98
> [ 47.816324] ---[ end trace 9a650b674f0fae76 ]---
> [ 47.816324] Fixing recursive fault but reboot is needed!
> ---- [end of call traces] ----
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
From 9aed1bd79531d91513cd16ed90872e4349425acc Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@xxxxxxxxxxxxx>
Date: Fri, 17 Apr 2015 23:50:48 -0400
Subject: [PATCH 1/2] block: blk-mq: fix race between timeout and CPU hotplug

Firstly during CPU hotplug, even queue is freezed, timeout
handler still may come and access hctx->tags, which may cause
use after free, so this patch deactivates timeout handler
inside CPU hotplug notifier.

Secondly, tags can be shared by more than one queues, so we
have to check if the hctx has been disabled, otherwise
still use-after-free on tags can be triggered.

Cc: <stable@xxxxxxxxxxxxxxx>
Reported-by: Dongsu Park <dongsu.park@xxxxxxxxxxxxxxxx>
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxxxxx>
---
block/blk-mq.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 67f01a0..58a3b4c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -677,8 +677,11 @@ static void blk_mq_rq_timer(unsigned long priv)
data.next = blk_rq_timeout(round_jiffies_up(data.next));
mod_timer(&q->timeout, data.next);
} else {
- queue_for_each_hw_ctx(q, hctx, i)
- blk_mq_tag_idle(hctx);
+ queue_for_each_hw_ctx(q, hctx, i) {
+ /* the hctx may be disabled, so we have to check here */
+ if (hctx->tags)
+ blk_mq_tag_idle(hctx);
+ }
}
}

@@ -2085,9 +2088,13 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
*/
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_freeze_queue_start(q);
- list_for_each_entry(q, &all_q_list, all_q_node)
+ list_for_each_entry(q, &all_q_list, all_q_node) {
blk_mq_freeze_queue_wait(q);

+ /* deactivate timeout handler */
+ del_timer_sync(&q->timeout);
+ }
+
list_for_each_entry(q, &all_q_list, all_q_node)
blk_mq_queue_reinit(q);

--
1.9.1

From 8b70c8612543859173230fbd16a63bacf84ba23a Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@xxxxxxxxxxxxx>
Date: Sat, 18 Apr 2015 00:01:31 -0400
Subject: [PATCH 2/2] blk-mq: fix CPU hotplug handling

Firstly the hctx->tags have to be set as NULL if it is to be disabled
no matter if set->tags[i] is NULL or not in blk_mq_map_swqueue() because
shared tags can be freed already from another request_queue.

The same situation has to be considered in blk_mq_hctx_cpu_online()
too.

Cc: <stable@xxxxxxxxxxxxxxx>
Reported-by: Dongsu Park <dongsu.park@xxxxxxxxxxxxxxxx>
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxxxxx>
---
block/blk-mq.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 58a3b4c..612d5c6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1580,15 +1580,20 @@ static int blk_mq_hctx_cpu_online(struct blk_mq_hw_ctx *hctx, int cpu)
{
struct request_queue *q = hctx->queue;
struct blk_mq_tag_set *set = q->tag_set;
+ struct blk_mq_tags *tags = set->tags[hctx->queue_num];

- if (set->tags[hctx->queue_num])
+ /* tags can be shared by more than one queues */
+ if (hctx->tags)
return NOTIFY_OK;

- set->tags[hctx->queue_num] = blk_mq_init_rq_map(set, hctx->queue_num);
- if (!set->tags[hctx->queue_num])
- return NOTIFY_STOP;
+ if (!tags) {
+ tags = blk_mq_init_rq_map(set, hctx->queue_num);
+ if (!tags)
+ return NOTIFY_STOP;
+ set->tags[hctx->queue_num] = tags;
+ }

- hctx->tags = set->tags[hctx->queue_num];
+ hctx->tags = tags;
return NOTIFY_OK;
}

@@ -1813,8 +1818,8 @@ static void blk_mq_map_swqueue(struct request_queue *q)
if (set->tags[i]) {
blk_mq_free_rq_map(set, set->tags[i], i);
set->tags[i] = NULL;
- hctx->tags = NULL;
}
+ hctx->tags = NULL;
continue;
}

--
1.9.1