Re: [RESEND PATCH] blk-mq: fix hang caused by freeze/unfreeze sequence

From: Bob Liu
Date: Wed Apr 17 2019 - 00:07:23 EST


On 4/15/19 5:46 PM, Roman Penyaev wrote:
> On 2019-04-13 05:42, Bart Van Assche wrote:
>> On 4/9/19 2:08 AM, Bob Liu wrote:
>>> Âvoid blk_freeze_queue_start(struct request_queue *q)
>>> Â{
>>> -ÂÂÂ int freeze_depth;
>>> -
>>> -ÂÂÂ freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
>>> -ÂÂÂ if (freeze_depth == 1) {
>>> +ÂÂÂ mutex_lock(&q->mq_freeze_lock);
>>> +ÂÂÂ if (++q->mq_freeze_depth == 1) {
>>> ÂÂÂÂÂÂÂÂ percpu_ref_kill(&q->q_usage_counter);
>>> +ÂÂÂÂÂÂÂ mutex_unlock(&q->mq_freeze_lock);
>>> ÂÂÂÂÂÂÂÂ if (queue_is_mq(q))
>>> ÂÂÂÂÂÂÂÂÂÂÂÂ blk_mq_run_hw_queues(q, false);
>>> +ÂÂÂ } else {
>>> +ÂÂÂÂÂÂÂ mutex_unlock(&q->mq_freeze_lock);
>>> ÂÂÂÂ }
>>> Â}
>> Have you considered to move the mutex_unlock() call to the end of the function
>> such that there is only one mutex_unlock() call instead of two? In case you
>> would be worried about holding the mutex around the code that runs the queue,
>> how about changing the blk_mq_run_hw_queues() call such that the queues are
>> run async?
>
> Hi Bart,
>
> The only purpose of 'mq_freeze_lock' is to avoid race between mq_freeze_depth
> variable and the following usage of q_usage_counter percpu ref. I admit that
> my original comment is quite unclear, but locked section should be as short
> as possible, so returning to your question: better to have two unlock calls
> instead of expanding locked critical section.
>
> Unfortunately I do not have hardware to play again with the issue, but I see
> there is a nice candidate for a quick reproduction:Â null_blk queues with
> shared tags. Having several queues with shared tags and a script, which
> powers on/off (I mean 'power' entry of configfs of the null_blk) different
> null devices from different cpus it is quite possible to trigger the issue.
> Random short msdelay() in correct places can help to increase probability to
> hit the issue quite fast.
>
>
> But Bob, what is the backtrace of the issue you hit? What is the device?
> Conditions to reproduce the issue are quite specific and frankly I did not
> find any "naked" (without any locks) calls of blk_mq_freeze/unfreeze sequence,
> the only candidate which I found, seems, null_blk (not 100% sure, but worth to
> try).
>

Yes, it can be reproduced with null_blk.
But I added a msleep to save time.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9437a5e..875967f1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -202,6 +202,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
+ msleep(1000);
percpu_ref_resurrect(&q->q_usage_counter);
wake_up_all(&q->mq_freeze_wq);
}



Below is the backtrace:
---
[ 234.604280] ------------[ cut here ]------------
[ 234.604288] percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
[ 234.604305] WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0
[ 234.604306] Modules linked in: null_blk thunderbolt ccm xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bnep arc4 intel_rapl iwlmvm x86_pkg_temp_thermal intel_powerclamp mac80211 coretemp snd_soc_skl snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp kvm_intel snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core nls_iso8859_1 snd_hda_codec_hdmi snd_hda_codec_conexant snd_compress ac97_bus snd_hda_codec_generic snd_pcm_dmaengine crct10dif_pclmul snd_hda_intel crc32_pclmul snd_hda_codec ghash_clmulni_intel iwlwifi snd_hda_core aesni_intel snd_hwdep snd_pcm thinkpad_acpi uvcvideo nvram ledtrig_audio aes_x86_64 crypto_simd btusb cryptd btrtl glue_helper snd_seq_midi btbcm snd_seq_midi_event intel_cstate videobuf2_vmalloc btintel snd_rawmidi cfg80211 bluetooth intel_rapl_perf
[ 234.604361] hid_sensor_accel_3d snd_seq hid_sensor_magn_3d videobuf2_memops videobuf2_v4l2 videobuf2_common hid_sensor_rotation rtsx_pci_ms hid_sensor_als wmi_bmof videodev input_leds joydev serio_raw hid_sensor_gyro_3d intel_wmi_thunderbolt hid_sensor_trigger snd_seq_device industrialio_triggered_buffer kfifo_buf snd_timer hid_sensor_iio_common media industrialio mei_me snd memstick ecdh_generic mei ucsi_acpi processor_thermal_device typec_ucsi intel_soc_dts_iosf intel_pch_thermal typec soundcore int3403_thermal int340x_thermal_zone int3400_thermal acpi_thermal_rel mac_hid acpi_pad sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 multipath linear hid_sensor_custom hid_sensor_hub intel_ishtp_hid hid_logitech_hidpp i915 kvmgt vfio_mdev mdev vfio_iommu_type1 vfio kvm irqbypass i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm wacom
[ 234.604417] hid_logitech_dj rtsx_pci_sdmmc hid_generic e1000e usbhid psmouse hid intel_ish_ipc rtsx_pci intel_ishtp wmi video
[ 234.604430] CPU: 0 PID: 11854 Comm: sh Not tainted 5.0.0+ #33
[ 234.604431] Hardware name: LENOVO 20LJS2EV08/20LJS2EV08, BIOS R0SET33W (1.17 ) 07/18/2018
[ 234.604436] RIP: 0010:percpu_ref_kill_and_confirm+0x99/0xb0
[ 234.604439] Code: 00 eb d3 80 3d 76 a4 2a 01 00 75 ab 48 8b 53 10 48 c7 c6 00 fd a6 8f 48 c7 c7 b0 56 d5 8f c6 05 5b a4 2a 01 01 e8 f7 a2 b1 ff <0f> 0b 48 8b 43 08 eb 85 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 234.604441] RSP: 0018:ffffa2880ae9fd00 EFLAGS: 00010086
[ 234.604443] RAX: 0000000000000000 RBX: ffff9768b3b815d8 RCX: 0000000000000000
[ 234.604445] RDX: 0000000000000055 RSI: ffffffff905758f5 RDI: 0000000000000046
[ 234.604447] RBP: ffffa2880ae9fd18 R08: fffffffa314c837d R09: ffffffffffffffff
[ 234.604449] R10: ffffe3c84fc0a200 R11: 000000000001abb4 R12: 0000000000000246
[ 234.604451] R13: 0000000000000000 R14: ffff9768b0448300 R15: ffff9768b2703600
[ 234.604454] FS: 00007fae2f3e6540(0000) GS:ffff9768d1400000(0000) knlGS:0000000000000000
[ 234.604456] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 234.604458] CR2: 000056078102ee08 CR3: 0000000431384002 CR4: 00000000003606f0
[ 234.604460] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 234.604462] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 234.604463] Call Trace:
[ 234.604471] blk_freeze_queue_start+0x2d/0x50
[ 234.604475] blk_set_queue_dying+0x17/0x40
[ 234.604479] blk_cleanup_queue+0x26/0xd0
[ 234.604486] null_del_dev+0x59/0x110 [null_blk]
[ 234.604491] nullb_device_power_store+0xca/0x100 [null_blk]
[ 234.604495] configfs_write_file+0xb9/0x120
[ 234.604500] __vfs_write+0x3a/0x1b0
[ 234.604505] ? apparmor_file_permission+0x1a/0x20
[ 234.604509] ? security_file_permission+0x3b/0xf0
[ 234.604513] ? _cond_resched+0x1a/0x50
[ 234.604516] vfs_write+0xb8/0x1b0
[ 234.604519] ksys_write+0x55/0xc0
[ 234.604523] __x64_sys_write+0x1a/0x20
[ 234.604528] do_syscall_64+0x5a/0x110
[ 234.604532] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 234.604535] RIP: 0033:0x7fae2ef04154
[ 234.604537] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 b1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[ 234.604538] RSP: 002b:00007ffddfe68e88 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 234.604540] RAX: ffffffffffffffda RBX: 00005649349e68a0 RCX: 00007fae2ef04154
[ 234.604542] RDX: 0000000000000002 RSI: 00005649349e68a0 RDI: 0000000000000001
[ 234.604543] RBP: 0000000000000002 R08: 0000000000000077 R09: 0000000000000000
[ 234.604544] R10: 00005649349e6010 R11: 0000000000000246 R12: 0000000000000001
[ 234.604545] R13: 0000000000000002 R14: 7fffffffffffffff R15: 00007ffddfe6aed9
[ 234.604548] ---[ end trace d99129291464ebae ]---