Re: [PATCH] Bio Traversal Changes

From: James Bottomley (
Date: Mon Aug 05 2002 - 10:47:39 EST said:
> There is only one call to ->request_fn for the entire request, and the
> drivers manages things underneath. The chunks are expected to complete
> sequentially. In the situation where the request is restarted in the
> event of an error (say), the submission pointers are rolled back to
> the last (successfully) completed point before issuing the request
> again.

Yes, that's the way I thought it would operate. said:
> I must say that I initially did think that this could be extended to
> the more generic case which you probably are referring to and that
> such an approach could take away the need to split bios in certain
> cases (i.e. when the i/o is destined for a single queue). Later it
> appeared that trying to cover the case where each of these pieces
> gets queued up and might complete out of order (requiring a tag to
> correlate things on completion), would most likely boil down to
> trying to maintain all the state that struct request does today.

For this more generic case, most of our problems seem to be because the
barrier has width: It actually belongs to an I/O request. If the barrier had
zero width (i.e. it was simply a barrier in the stream with no I/O attached)
then it would be much easier to preserve it correctly across this (or any
other) type of bio splitting. It would also make it much more obvious to the
implementing driver where the barrier was supposed to be in the I/O stream,
and would allow more efficient "wait for completion" barrier implementations
for drivers that couldn't enforce it any other way.

> Would be nice (for me) to understand this in more detail. There might
> be some possibilities. Any pointers that I can look up to get a
> clearer idea ?

The SCSI standards ( are the only real authoritative source (with
even some explanation). However, I'll do my best to summarise.

In SCSI, commands are allowed to disconnect, that is suspend temporarily while
the device does other things. When the device implements tag command
queueing, it is allowed to disconnect one command and subsequently reconnect
(restart) a different one. In theory, this means that we can have multiple
active I/Os at once. The way you signal to the scsi device that you want a
barrier is to label one or more of the tags as "ordered" which means that the
device must complete all I/O of tags prior to the ordered one before it and
may not begin I/O of subsequent tags until the ordered tag has completed.

looping a single request over a big bio means that the SCSI device sees the
I/O as a discrete stream of tags. However, we lose throughput if we stall the
queue waiting for this single bio to complete and we can't work out what the
next tag is until the prior tag completes. In the non barrier case,
everything will still be OK as long as the queue isn't stalled because we'll
be getting throughput from other bios coming down.

I think basically, I'd like to translate as much of the bio as I can into SCSI
tags to improve throughput and each tag currently requires a struct request.

> Does completion notification happen only when all the commands
> covered by a single tag complete ? Otherwise, what is the ordering
> amongst the multiple commands in question (do they complete in serial
> order as well) ?

Yes and no. You get a special completion code (INTERMEDIATE_TASK_COMPLETE)
which says "I've finished this bit, give me the next part". You don't get a
real SCSI completion until the last part of the linked task set completes.
The task is linked sequentially, so it does complete in serial order.

However, Don't worry about the linked task stuff, it's a rather esoteric area
of the SCSI standard (that allows a single tag to be used across multiple I/Os
in very much the same way the bio splitting works) which, on mature
reflection, probably isn't such a good idea to use since I'd be doubtful about
how well it's implemented in the devices we have to deal with.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

This archive was generated by hypermail 2b29 : Wed Aug 07 2002 - 22:00:28 EST