Re: elevator-starvation-4 (2.2.14 && 2.3.42)

From: Andrea Arcangeli (andrea@suse.de)
Date: Fri Feb 11 2000 - 11:19:46 EST


On Fri, 11 Feb 2000, Bruce Thompson wrote:

> It need not be a case of "The driver can handle X MB/sec so give
>the other one X/2 MB/sec of requesting ability". One thought that

Well it's not so trivial to know the drivers run at X MB/sec. You should
calculate also in function of the seeks and in function of the underlying
LVM physical volumes.

>process. Once the counter exceeds some particular highwater mark the
>process is blocked from entering more requests until the counter
>drops to some lowwater mark. Hmm. ANother thought. Make the counter

You never generate I/O from tasks. It's when the buffer is too old that
kupdate flushes writes to disk. At that time the process is just exited
and the user just run another program that gets stalled due the write
flood.

Also kupdate feed at the maximal speed in one chunk all the requests we
got in the last 5 seconds even if the task written at slower frequencey
and that looks fine for CPU icache optimizations, coalescing and elevating
the requests to reduce the global load.

> Either way, it's an interesting problem, and I truly do believe
>that attacking it via mods to the elevator algorithm is merely
>putting off the inevitable whereas attacking the root cause will
>allow us to degrade much more gracefully in the face of massive I/O
>activity.

The starvation problem can be reproduced perfectly fine with 30000 tasks
that write 1 request per task and they are writing all at 1/30000 of the
max throughtput that the I/O can handle. But the 30001 task gets starved
for two days. See? The bug is definitly into the elevator and I think the
best way to fix that is with a per-request maximal latency as I just did.
Trying to hide it in the other layers is wrong IMHO.

I believe the theory we studied was missing the basic point I got in the
last weeks. It looked correct to me too at first. But then I understood it
simply doesn't apply in practice and so I designed a way to enforce a max
latency per-request and the longstanding problem got now finally fixed in
practice as far I can tell.

BTW the elevator bug is not only a latency issue, it's also a reliability
issue, a write request can be delayed for too long time too.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Feb 15 2000 - 21:00:20 EST