Soft lockups when using an SCSI tape device

From: Willy Tarreau
Date: Tue Sep 22 2009 - 06:04:38 EST


Hello,

at work we've been bothered for a while with a backup tool
trigerring kernel panics. The machine is a 64-bit Core2Duo,
it runs CentOS 5.x with an updated kernel (right now we're
on a slightly patched 2.6.27.29), but many kernels since
2.6.22 have been showing the same issue.

As it happened today and I was here, I took a photo of the
panic and rewrote it down. Here it is :

INFO: task mt: 22922 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
mt D 0000000000000000 0 22922 22917
ffff880003a99c88 0000000000000082 0000000000000000 ffff880031a40f00
ffff880003a58000 ffff88007f862d00 ffff880003a58230 ffffffff80829000
ffffffff8082f780 000000013f3aec8a 000000000000000f ffffffff804e8af2
Call Trace:
[<ffffffff804e8af2>] scsi_request_fn+0x222/0x350
[<ffffffff80629105>] schedule_timeout+0x95/0xd0
[<ffffffff804e7ed0>] scsi_execute_async+0x2f0/0x3c0
[<ffffffff806286a5>] wait_for_common+0xa5/0x160
[<ffffffff80233890>] default_wake_function+0x0/0x10
[<ffffffff80505b76>] st_do_scsi+0x1f6/0x2c0
[<ffffffff80505260>] st_sleep_done+0x0/0x90
[<ffffffff80507719>] do_load_unload+0xb9/0x180
[<ffffffff8050a571>] st_ioctl+0x941/0x10e0
[<ffffffff80283a44>] handle_mm_fault+0x234/0x740
[<ffffffff802a988f>] vfs_ioctl+0x2f/0xa0
[<ffffffff802a996f>] do_vfs_ioctl+0x6f/0x2b0
[<ffffffff802a9c41>] sys_ioctl+0x91/0xb0
[<ffffffff8020c28b>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: softlockup: blocked tasks

It's important to note that the tape was ejected, the panic apparently
occured on return of the mt eject command.

# uname -a
Linux carbone.exosec.local 2.6.27-wt9-carbone #1 SMP Mon Aug 3 09:50:14 CEST 2009 x86_64 x86_64 x86_64 GNU/Linux

We have SOFTLOCKUP enabled :
CONFIG_DEBUG_KERNEL=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1

I have found the exact place where the lock is held. It's in
drivers/scsi/scsi_lib.c:request_fn() line 1603:

1600 out:
1601 /* must be careful here...if we trigger the ->remove() function
1602 * we cannot be holding the q lock */
>1603< spin_unlock_irq(q->queue_lock);
1604 put_device(&sdev->sdev_gendev);
1605 spin_lock_irq(q->queue_lock);
1606 }
1607

As I understand it, someone else holds the queue lock. Note
that I also have CONFIG_TRACE_IRQFLAGS_SUPPORT=y, and I must
admit that I got lost into the tentacles of the macros and
inlines called from spin_unlock_irq(). I don't have PREEMPT
though.

I have reviewed the changes to st.c since this kernel and do
not see anything obviously relevant. I've found a few apparently
similar issues on the net, one of which is here :

http://article.gmane.org/gmane.linux.debian.devel.bugs.general/613223

I don't know where to look for right now. I'd like some advices,
maybe some options to pass to the kernel at boot, soem config
options to change (as long as they don't affect performance
much nor require frequent reboots, since it's a production
server).

I can send the full config if needed, although I'm not sure it
would help.

Thanks in advance,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/