Re: Strange write starvation on 2.6.17 (and other) kernels

From: Neil Brown
Date: Tue Aug 22 2006 - 03:38:27 EST

On Monday August 21, szymans@xxxxxxxxxx wrote:
> Neil Brown wrote:
> > On Thursday August 17, miquels@xxxxxxxxxx wrote:
> >
> > Can you report the contents of /proc/meminfo before, during, and after
> > the long pause?
> > I'm particularly interested in MemTotal, Dirty, and Writeback, but the
> > others are of interest too.
> >
> > Thanks,
> > NeilBrown
> I've prepared two logs, from different machines, the first one is on
> software RAID5 (4 ATA disks) with deadline scheduler, the second on a
> single ATA disk with CFQ scheduler. In the first case 10 writer threads
> are sufficient to give large delays, in the second case I've run an
> additional tar thread reading from the disk.
> The logs are here:
> Each writer starts with:
> Writing 200 MB to stdout without fsync
> than reports each write() that lasts > 3s along with pid:
> 6582 - Delayed 4806 ms.
> And finishes with:
> Max write delay: 14968 ms.
> Andrzej.

I spent altogether too long exploring this :-)

I think all that is happening here is that the delay that has to be
imposed on the different write calls is being imposed somewhat

Suppose your filesystem/device can sustain 15MB/sec (like my
Then with 10 concurrent threads, each should expect to write at
1.5MB/sec, or 1 Megabyte every 667 milliseconds

So you might expect each 1Meg write to take 667msec.
However Linux only imposes write throttling every 4Meg (If you have
128Meg or RAM of more). So you would expect 3 out of 4 requests to be
instantaneous, and 1 out of 4 to take 2.667 seconds.

However it very hard to impose rate limiting completely uniformly.
For example the block device driver imposes a queue length limit when
space becomes available on the queue, it is fairly random which
process gets to use it.
Further the block layer actually imposes some unfairness: when a
process gets an entry on the queue, it is (almost) guaranteed another
31 so it will make lots of progress and the expense of anyone else.
It does this to improve throughput (I believe).

So getting individual 1Meg writes taking a few seconds is quite likely.
Having a few take several seconds should be expected.
You report numbers up to 26 seconds. That is clearly quite a lot, but
I'm not sure there is much to be done about it - imposing strict
fairness will always have a cost. That 26 seconds was in a run where
there were 2000 writes. You would expect 500 to hit delays. Only 63
hit delays > 3seconds.

In my various experimenting the one thing that was effective in
improving the fairness was to make Linux impose write throttling more
In mm/page-writeback.c, in set_ratelimit() where ratelimit_pages is
calculated, cap it at much lower. e.g. set it unconditionally to 16.

This might increase CPU load on multi-processor machines, but it does
seem to increase the fairness of write throttling.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at