what's the semaphore in requests for?

From: Peter T. Breuer (ptb@it.uc3m.es)
Date: Mon Jul 23 2001 - 18:39:33 EST

What's the semaphore field in requests for? Are driver writers supposed
to be using it?

The reason I ask is that I've been chasing an smp bug in a block driver
of mine for a week. The bug only shows up in 2.4 kernels (not in same
code under 2.2.18) and only with smp ("nosmp" squashes it). It only
shows up when running dd in user space copying from my device to
a disk device. It doesn't show when copying to /dev/null.

The symptom is a complete kernel lockup. Not even sysreq works.
It's driving me crazy. It sems to get very easy to trigger in 2.4.6,
while it was hard or impossible to trigger back in 2.4.0 and 2.4.1.

I have added the sgi kdb stuff in order to get a handle. For a while I
was getting some ouches from the nmi watchdog saying that one cpu was
locked, followed by a jump into the kdb monitor. But I'm not getting that
now. In any case I haven't learned how to use kdb properly yet, so
I couldn't make out much from the stack info.

The bug maybe shows on write from a local disk to the device too, but
it's at least 10 times as hard to trigger that way. It does NOt trigger
when writing to the device from /dev/zero. I'm not sure it shows in all
my smp machines either .. most of them have been slightly unstable
under 2.4.* anyway, locking up on timescales of 1 day to a week. Could
be apic (asus and dell bx), but I was running my own machine noapic and
it didn't affect the bug.

The block driver is largely in userspace. All the kernel half does
is transfer requests to a local queue (with the io lock still held, of
course). The userspace daemon cycles continously doing ioctls that
copy the requests (bh by bh) into userspace, where its treated via
some networking calls, then return an ack via another ioctl.

The drivers local queue is protected by a semaphore. The thing that
puzzles me is that the bug shows only when copying to a disk device,
not to /dev/null, through userspace! Is it that the lifetime of a
request is much longer than expected?

I have some impression that the bug is dependent on speed too. If I
limit the speed of the device, I think I don't see the bug - but
definitive results are very hard to come by because I have to copy
about 2GB from the device to be sure of triggering it.

Oh well, if anyone has any insight or any plans for further hunting,
please let me know.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

This archive was generated by hypermail 2b29 : Mon Jul 23 2001 - 21:00:18 EST