Re: I/O request ordering

Stephen C. Tweedie (
Fri, 9 Aug 1996 00:11:20 +0100


[Long, but there's a lot to talk about...]

The introduction:

On Wed, 7 Aug 96 20:20:09 METDST, Molnar Ingo
<> said:

> On Tue, 6 Aug 1996, Stephen C. Tweedie wrote:
>> On Tue, 6 Aug 1996 11:55:46 +0300 (EET DST), Linus Torvalds
>> <> said:
>> > On Mon, 5 Aug 1996, Ray Van Tassle-CRV004 wrote:
>> >> For some time, I have had doubts about the I/O request ordering
>> >> algorithm in drivers/block/ll_blk_rw. ...
>> > If you _really_ want to improve performance, you should probably
>> > not use an elevator. Instead, you should use something like
>> > "saw-tooth with a window" ...
>> ... There is a major problem with this sort of window, which is
>> that fairness is sacrificed.

> As addition, and as possible solution: a small theoretical analysis and
> a proposed pseudocode algorithm.

Molnar proposes a solution. Briefly:

> Lets consider the following model for the disk-IO problem:

> We have a statistical system, where 'requests' get added, and we
> have to keep 'head movement' low. Requests can be reordered. Each
> request has a 'position' and a 'size'. We have a boundary condition:
> we have to process each request within a predefined amount of time.

That's the starvation criterion. It's not the only one; we also want
both fairness and throughput.

> Our task is to create dynamics for this system (an algorithm), which
> satisfy the boudary condition and keep the 'head movement' function
> at a minimum.

> Even this simple model shows many interesting things:

> Best would be if we waited forever, to get the lowest value for the
> 'head movement' function.

We can't. This is the biggest problem with the whole issue. Trouble
is, the kernel is simply not presented with a list of IO requests and
told to go service them; it is issued with one request, and only once
that request has been serviced does it get the next request from the
same process. It's a real bummer, and it means that there is a limit
to what you can achieve just by reordering requests. You really want
to _predict_ requests, as far into the future as possible. This is
one of the reasons why good sequential readahead is so important, for

There are some perhaps counter-intuitive results in the literature
worth looking at. Ruemmler and Wilkes [1] performed a 4-month long
observation of ALL disk accesses on an HP/UX machine and found that
there was really very little to be gained by moving from a scan
algorithm to ssf (shortest seek first), largely because read requests
are indeed synchronous and so not particularly tractable, and write
requests tend to be batched pretty well anyway.

Seltzer, Chen and Ousterhout [2] simulated a number of possible
algorithms, including scan and ssf, but also found that scan returns
disk utilisations "essentially identical" with ssf. However, scan
substantially improves the maximum observed response time; ssf has a
much larger variance in response times due to its inherently unfair
nature, as expected. They propose a new scheduler, "shortest time
first", which takes into account not only seek times but rotational
delays, and which provides significant improvements in utilisation for
very large request queue lengths; however, it is much more CPU
intensive to calculate.

In both studies, improvements to performance came about primarily as a
result of optimising large batches of write requests. The synchronous
nature of read requests meant that there was very little improvement
seen in read performance as a result of scheduling changes.

> A 'good method' is to 'simulate' an 'infinite request queue'. This
> can be achieved by grouping requests into intervals, and processing
> each interval as if they were independent of each other. An
> independent interval should be processed by using the best route
> method, this is simply the elevator algorithm over a static request
> queue.

This will harm performance; we will end up skipping the head over
requests in the "other" queue, resulting in up to twice the number of
passes over the disk surface to service the requests.

> The sawtooth method breaks the boundary condition, it's not
> guaranteed that all requests get processed.

Yes it is. It is also fair; requests in the middle of a large batch,
or towards one end of the disk, are not unfairly penalised or
advantaged by the single-direction scan scheduler.

On Thu, 8 Aug 1996 9:18:41 -0500, Ray Van Tassle-CRV004
<> said:

> Interesting comments and analyses--just what I was hoping for. Usenet at
> it's finest.

> As Linus (and others) brought to light, the killer for a true elevator
> algorithm is a burst of requests to the same area of the disk, but not in
> the same direction that the elevator (or sawtooth) is moving.

No, that's really not much of a problem, actually. In this scenario
you have two typical cases: either all of the requests are submitted
at once, in which case the worst that happens is you start halfway
through them, continue in order forward through the disk and then
again from the start; or they are submitted one by one, in which case
you can do no better than first-come-first-served.

Remember, this type of "batched" request is typically encountered for
writes, not reads, and there is not the same urgency about servicing
write requests as there is for read requests. Maintaining high disk
throughput is MUCH more important in this case.

> You'd really like to have shortest-latency first for these. The
> trick is knowing when to _stop_, so that others don't starve.

No, as I've said, starvation isn't that much of a problem for writes.

> Other than this, a true elevator would seem to give the best mix between
> short latency and fairness.

Very, very true. That's exactly why I am uncertain about the merits
of trying to fiddle too much with the existing device-level scheduler.
Much better would be to try to improve its implementation to reduce
the CPU overhead, for example, or to try to tweak the buffer-level
sync() write scheduler.

> Some of the ideas that people tossed in were WAAAAAAAAY too complicated, but
> did give me an idea which should be simple to implement and avoid the
> possibility of starving some requests:
> 1) real elevator-ordering, but
> 2) if a new request is within X sectors (or cylinders) of the CURRENT
> request, insert it near the front. IOW, there would be a small clustering
> of new requests (ordered perhaps by smallest distance between requests (i.e.,
> shortest latency)) right after the current request. X would be chosen to
> represent a seek of half-a-dozen cylinders or less.
> This would catch a flurry of I/O to a small area--such as loading an
> executable--no matter which way the overall elevator was going.

No, it won't!!!!! This is the whole problem!!! If you have got a
pile of outstanding requests, then what happens when you load an
executable is that you get a page fault; one single page is requested;
once that request is serviced, the application can continue to run,
but in the mean time there are no more requests from that process, so
we end up seeking somewhere else on the disk; and later, the
application references another page, so we're right back to square
one, having to do another seek to the executable binary.

You can only improve this by either deferring resumption of the
request queue for a clock tick or so, to allow the application to
request another IO; or by trying to predict the application's future
behaviour and to page in a bit more than necessary when a page fault
occurs. I'll be doing some form of the latter in the next kswap
patches, but I'm reluctant to do the former, since there are all sorts
of unpleasant fairness implications and we can end up seriously
increasing device latency when (not if) we get the prediction wrong.

> Yes, I can't stress enough that this must be THROUGHLY tested and
> benchmarked.

Heheheh. You ain't kidding here.

> My thoughts here are to do all these:
> 1) make dep
> 2) make clean zImage
> 3) umsperf
> 4) Bonnie
> 5) multiple greps (as suggested by Linus)
> 6) untar the kernel
> 7) emacs a large file, and immediately quit
> All these should be run 5-10 times. Average time should be better than
> current sawtooth, and there should not be any large (i.e., bad) peak.

I disagree. Maybe I'm just a pessimist, but I really think you need
to implement it, throw it at a thousand odd beta testers for
evaluation, put it into the current development kernel, play with it
for a dozen releases while you get lots and lots of negative feedback,
tweak it all the time, then throw it out. :)

People HAVE tried to improve IO request scheduling in the past. The
conclusions seem to be that yes, you can improve things at the device
request layer IF you know exactly what the physical characteristics of
the disk --- cylinder format, seek timings, command setup latencies
and rotation speed --- are. Otherwise, you are much better off trying
to optimise the parts of the system which are generating the requests
in the first place, by introducing techniques such as the
log-structured filesystem (but that's another can of worms), adaptive
readahead and by optimising the requesting of buffer writes in the
first place.


Stephen Tweedie <>
Department of Computer Science, Edinburgh University, Scotland.

1] Chris Ruemmler and John Wilkes, "UNIX Disk Access Pattern". HP Laboratories Technical Report #HPL-92-152, December 1992. Published in Proceedings of the Usenix Winter 1993 Technical Conference.

2] Margo Seltzer, Peter Chen and John Ousterhout, "Disk Scheduling Revisited". Computer Science Division, UCB.