Re: kernel BUG at drivers/scsi/scsi_lib.c:1096!

From: Takashi Iwai
Date: Fri Dec 04 2015 - 11:59:28 EST


On Wed, 25 Nov 2015 20:01:47 +0100,
Hannes Reinecke wrote:
>
> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> > On Wed, Nov 25 2015 at 4:04am -0500,
> > Hannes Reinecke <hare@xxxxxxx> wrote:
> >
> >> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>> Can't we have a joint effort here?
> >>>> I've been spending a _LOT_ of time trying to debug things here, but
> >>>> none of the ideas I've come up with have been able to fix anything.
> >>>
> >>> Yes. I'm not the one primarily looking at it, and we don't have a
> >>> reproducer in-house. We just have the one dump right now.
> >>>
> >>>>
> >>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>> by one and be done with ...
> >>>>
> >>>
> >>> That might not fix it if it is a problem with the merge code, though.
> >>>
> >> And indeed, it doesn't.
> >
> > How did you arrive at that? Do you have a reproducer now?
> >
> Not a reproducer, but several dumps for analysis.
>
> >> Seems I finally found the culprit.
> >>
> >> What happens is this:
> >> We have two paths, with these seg_boundary_masks:
> >>
> >> path-1: seg_boundary_mask = 65535,
> >> path-2: seg_boundary_mask = 4294967295,
> >>
> >> consequently the DM request queue has this:
> >>
> >> md-1: seg_boundary_mask = 65535,
> >>
> >> What happens now is that a request is being formatted, and sent
> >> to path 2. During submission req->nr_phys_segments is formatted
> >> with the limits of path 2, arriving at a count of 3.
> >> Now the request gets retried on path 1, but as the NOMERGE request
> >> flag is set req->nr_phys_segments is never updated.
> >> But blk_rq_map_sg() ignores all counters, and just uses the
> >> bi_vec directly, resulting in a count of 4 -> boom.
> >>
> >> So the culprit here is the NOMERGE flag,
> >
> > NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
> >
> Yes.
>
> >> which is evaluated via
> >> ->dm_dispatch_request()
> >> ->blk_insert_cloned_request()
> >> ->blk_rq_check_limits()
> >
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> > anyway after reading your mail I'm still left wondering if your proposed
> > patch is correct.
> >
> >> If the above assessment is correct, the following patch should
> >> fix it:
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 801ced7..12cccd6 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >> */
> >> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >> {
> >> - if (!rq_mergeable(rq))
> >> + if (rq->cmd_type != REQ_TYPE_FS)
> >> return 0;
> >>
> >> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >> rq->cmd_flags)) {
> >>
> >>
> >> Mike? Jens?
> >> Can you comment on it?
> >
> > You're not explaining the actual change in the patch very well; I think
> > you're correct but you're leaving the justification as an exercise to
> > the reviewer:
> >
> > blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> > !rq_mergeable() check but you're saying for this case in question we
> > never get there -- due to the cloned request having NOMERGE set.
> >
> > So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> > open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
> >
> > I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> > the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> > sense for cloned requests that always have NOMERGE set.
> >
> > So you're saying that by having blk_rq_check_limits() go on to call
> > blk_recalc_rq_segments() this bug will be fixed?
> >
> That is the idea.
>
> I've already established that in all instances I have seen so far
> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
>
> As it turns out, req->nr_phys_segemnts _would_ have been updated in
> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
> for the cloned request.
> So each cloned request inherits the values from the original request,
> despite the fact that req->nr_phys_segments _has_ to be evaluated in
> the final request_queue context, as the queue limits _might_ be
> different from the original (merged) queue limits of the multipath
> request queue.
>
> > BTW, I think blk_rq_check_limits()'s export should be removed and the
> > function made static and renamed to blk_clone_rq_check_limits(), again:
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
> >
> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is
> pointless, too, so we might as well remove the entire if-clause.
>
> > Seems prudent to make that change now to be clear that this code is only
> > used by cloned requests.
> >
> Yeah, that would make sense. I'll be preparing a patch.
> With a more detailed description :-)

Do we have already a fix? Right now I got (likely) this kernel BUG()
on the almost latest Linus tree (commit 25364a9e54fb8296). It
happened while I started a KVM right after a fresh boot. The machine
paniced even before that, so I hit this twice today.

I have a crash dump, so if anything is needed, let me know.
(But the system had to be restarted since it's my main workstation.)


thanks,

Takashi

===
[ 137.211255] ------------[ cut here ]------------
[ 137.211265] WARNING: CPU: 0 PID: 391 at block/blk-merge.c:435 blk_rq_map_sg+0x430/0x4d0()
[ 137.211267] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[ 137.211326] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[ 137.211342] CPU: 0 PID: 391 Comm: kworker/0:1H Not tainted 4.4.0-rc3-test2+ #183
[ 137.211344] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[ 137.211349] Workqueue: kblockd bfq_kick_queue
[ 137.211352] ffffffff81a6b805 ffff880214307ab8 ffffffff813b1cc9 0000000000000000
[ 137.211357] ffff880214307af0 ffffffff8106d9a2 ffff88019e7d49c8 0000000000000000
[ 137.211362] 0000000000000000 0000000000001000 0000000000000020 ffff880214307b00
[ 137.211367] Call Trace:
[ 137.211373] [<ffffffff813b1cc9>] dump_stack+0x4b/0x72
[ 137.211378] [<ffffffff8106d9a2>] warn_slowpath_common+0x82/0xc0
[ 137.211383] [<ffffffff8106da9a>] warn_slowpath_null+0x1a/0x20
[ 137.211387] [<ffffffff81381340>] blk_rq_map_sg+0x430/0x4d0
[ 137.211393] [<ffffffff81514a60>] ? scsi_init_cmd_errh+0x90/0x90
[ 137.211398] [<ffffffff81514fc4>] scsi_init_sgtable+0x44/0x80
[ 137.211402] [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[ 137.211408] [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[ 137.211413] [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[ 137.211416] [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[ 137.211420] [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[ 137.211424] [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[ 137.211428] [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[ 137.211433] [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[ 137.211438] [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[ 137.211442] [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[ 137.211447] [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[ 137.211451] [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[ 137.211456] [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[ 137.211460] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.211464] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.211468] [<ffffffff8109206e>] kthread+0xfe/0x120
[ 137.211473] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.211478] [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[ 137.211482] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.211485] ---[ end trace b39a62f95b1cad19 ]---
[ 137.211499] ------------[ cut here ]------------
[ 137.211506] kernel BUG at drivers/scsi/scsi_lib.c:1096!
[ 137.211511] invalid opcode: 0000 [#1] PREEMPT SMP
[ 137.211519] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[ 137.211630] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[ 137.211660] CPU: 0 PID: 391 Comm: kworker/0:1H Tainted: G W 4.4.0-rc3-test2+ #183
[ 137.211668] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[ 137.211675] Workqueue: kblockd bfq_kick_queue
[ 137.211681] task: ffff880214754a40 ti: ffff880214304000 task.ti: ffff880214304000
[ 137.211713] RIP: 0010:[<ffffffff81514ff2>] [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[ 137.211747] RSP: 0018:ffff880214307b90 EFLAGS: 00010097
[ 137.211777] RAX: 000000000000000e RBX: ffff8802147e68b0 RCX: 0000000000000006
[ 137.211807] RDX: 0000000000000024 RSI: 0000000000000000 RDI: 0000000000000009
[ 137.211838] RBP: ffff880214307ba8 R08: 0000000000000001 R09: 0000000000000000
[ 137.211868] R10: ffff8802147e68b0 R11: 0000000000000001 R12: ffff88003711c320
[ 137.211899] R13: 0000000000000000 R14: ffff8800370be000 R15: ffff8802147e68b0
[ 137.211930] FS: 0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
[ 137.211961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 137.211991] CR2: 000000000041bee0 CR3: 00000000d85ac000 CR4: 00000000001426f0
[ 137.212021] Stack:
[ 137.212048] ffff88003711c200 000000000003b000 0000000000000000 ffff880214307bf0
[ 137.212082] ffffffff81515045 ffff880214307c28 0000000000000092 ffff8802147e68b0
[ 137.212115] 000000000003b000 ffff8800d8a90800 ffff8800370be000 0000000000000020
[ 137.212148] Call Trace:
[ 137.212178] [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[ 137.212205] [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[ 137.212232] [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[ 137.212258] [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[ 137.212285] [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[ 137.212312] [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[ 137.212338] [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[ 137.212365] [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[ 137.212392] [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[ 137.212418] [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[ 137.212444] [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[ 137.212471] [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[ 137.212498] [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[ 137.212525] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.212552] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.212578] [<ffffffff8109206e>] kthread+0xfe/0x120
[ 137.212605] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.212631] [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[ 137.212657] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.212683] Code: 24 08 72 27 41 89 44 24 08 8b 43 5c 5b 41 89 44 24 10 44 89 e8 41 5c 41 5d 5d c3 41 bd 02 00 00 00 5b 44 89 e8 41 5c 41 5d 5d c3 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[ 137.212798] RIP [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[ 137.212826] RSP <ffff880214307b90>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/