Re: unfair io behaviour for high load interactive use still present in 2.6.31

From: Corrado Zoccolo
Date: Wed Sep 16 2009 - 03:29:16 EST


Hi Tobias,
On Tue, Sep 15, 2009 at 11:07 PM, Tobias Oetiker <tobi@xxxxxxxxxx> wrote:
> Today Corrado Zoccolo wrote:
>
>> On Tue, Sep 15, 2009 at 9:30 AM, Tobias Oetiker <tobi@xxxxxxxxxx> wrote:
>> > Below is an excerpt from iostat while the test is in full swing:
>> >
>> > * 2.6.31 (8 cpu x86_64, 24 GB Ram)
>> > * scheduler = cfq
>> > * iostat -m -x dm-5 5
>> > * running in parallel on 3 lvm logical volumes
>> > Âon a single physical volume
>> >
>> > Device:     rrqm/s  wrqm/s   r/s   w/s  ÂrMB/s  ÂwMB/s avgrq-sz avgqu-sz  await Âsvctm Â%util
>> > ---------------------------------------------------------------------------------------------------------
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 26.60 7036.60 Â Â 0.10 Â Â27.49 Â Â 8.00 Â 669.04 Â 91.58 Â 0.09 Â66.96
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 63.80 2543.00 Â Â 0.25 Â Â 9.93 Â Â 8.00 Â1383.63 Â539.28 Â 0.36 Â94.64
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 78.00 5084.60 Â Â 0.30 Â Â19.86 Â Â 8.00 Â1007.41 Â195.12 Â 0.15 Â77.36
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 44.00 5588.00 Â Â 0.17 Â Â21.83 Â Â 8.00 Â 516.27 Â 91.69 Â 0.17 Â95.44
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 6014.20 Â Â 0.00 Â Â23.49 Â Â 8.00 Â1331.42 Â 66.25 Â 0.13 Â76.48
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 28.80 4491.40 Â Â 0.11 Â Â17.54 Â Â 8.00 Â1000.37 Â412.09 Â 0.17 Â78.24
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 36.60 6486.40 Â Â 0.14 Â Â25.34 Â Â 8.00 Â 765.12 Â128.07 Â 0.11 Â72.16
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 33.40 5095.60 Â Â 0.13 Â Â19.90 Â Â 8.00 Â 431.38 Â 43.78 Â 0.17 Â85.20
>> >
>> > for comparison these are the numbers for seen when running the test
>> > with just the writing enabled
>> >
>> > Device:     rrqm/s  wrqm/s   r/s   w/s  ÂrMB/s  ÂwMB/s avgrq-sz avgqu-sz  await Âsvctm Â%util
>> > ---------------------------------------------------------------------------------------------------------
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â4.00 12047.40 Â Â0.02 Â Â47.06 Â Â 8.00 Â 989.81 Â 79.55 Â 0.07 Â79.20
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â3.40 12399.00 Â Â0.01 Â Â48.43 Â Â 8.00 Â 977.13 Â 72.15 Â 0.05 Â58.08
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â3.80 13130.00 Â Â0.01 Â Â51.29 Â Â 8.00 Â1130.48 Â 95.11 Â 0.04 Â58.48
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â2.40 5109.20 Â Â 0.01 Â Â19.96 Â Â 8.00 Â 427.75 Â 47.41 Â 0.16 Â79.92
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â3.20 Â Â0.00 Â Â 0.01 Â Â 0.00 Â Â 8.00 Â 290.33 148653.75 282.50 Â90.40
>> > dm-5 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â3.40 5103.00 Â Â 0.01 Â Â19.93 Â Â 8.00 Â 168.75 Â 33.06 Â 0.13 Â67.84
>> >
>>
>> From the numbers, it appears that your controller (or the disks) is
>> lying about writes. They get cached (service time around 0.1ms) and
>> reported as completed. When the cache is full, or a flush is issued,
>> you get >200ms delay.
>> Reads, instead, have a more reasonable service time of around 2ms.
>>
>> Can you disable write cache, and see if the response times for writes
>> become reasonable?
>
> sure can ... the write-only results look like this. As you have predicted,
> the wMB has gone down to more realistic levels ... but also the await has gone way up.
>
> Device:     rrqm/s  wrqm/s   r/s   w/s  ÂrMB/s  ÂwMB/s avgrq-sz avgqu-sz  await Âsvctm Â%util
> ---------------------------------------------------------------------------------------------------------
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2566.80 Â Â 0.00 Â Â10.03 Â Â 8.00 Â2224.74 Â737.62 Â 0.39 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â9.60 Â679.00 Â Â 0.04 Â Â 2.65 Â Â 8.00 Â 400.41 1029.73 Â 1.35 Â92.80
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2080.80 Â Â 0.00 Â Â 8.13 Â Â 8.00 Â 906.58 Â456.45 Â 0.48 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2349.20 Â Â 0.00 Â Â 9.18 Â Â 8.00 Â1351.17 Â491.44 Â 0.43 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â3.80 Â665.60 Â Â 0.01 Â Â 2.60 Â Â 8.00 Â 906.72 1098.75 Â 1.39 Â93.20
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1811.20 Â Â 0.00 Â Â 7.07 Â Â 8.00 Â1008.23 Â725.34 Â 0.55 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2632.60 Â Â 0.00 Â Â10.28 Â Â 8.00 Â1651.18 Â640.61 Â 0.38 100.00
>

Good.
The high await is normal for writes, especially since you get so many
queued requests.
can you post the output of "grep -r . /sys/block/_device_/queue/" and
iostat for your real devices?
This should not affect reads, that will preempt writes with cfq.

>
> in the read/write test the write rate stais down, but the rMB/s is even worse and the await is also way up,
> so I guess the bad performance is not to blame on the the cache in the controller ...
>
> Device:     rrqm/s  wrqm/s   r/s   w/s  ÂrMB/s  ÂwMB/s avgrq-sz avgqu-sz  await Âsvctm Â%util
> ---------------------------------------------------------------------------------------------------------
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1225.80 Â Â 0.00 Â Â 4.79 Â Â 8.00 Â1050.49 Â807.38 Â 0.82 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1721.80 Â Â 0.00 Â Â 6.73 Â Â 8.00 Â1823.67 Â807.20 Â 0.58 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1128.00 Â Â 0.00 Â Â 4.41 Â Â 8.00 Â 617.94 Â832.52 Â 0.89 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 Â838.80 Â Â 0.00 Â Â 3.28 Â Â 8.00 Â 873.04 1056.37 Â 1.19 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 39.60 Â347.80 Â Â 0.15 Â Â 1.36 Â Â 8.00 Â 590.27 1880.05 Â 2.57 Â99.68
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1626.00 Â Â 0.00 Â Â 6.35 Â Â 8.00 Â 983.85 Â452.72 Â 0.62 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1117.00 Â Â 0.00 Â Â 4.36 Â Â 8.00 Â1047.16 1117.78 Â 0.90 100.00
> dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1319.20 Â Â 0.00 Â Â 5.15 Â Â 8.00 Â 840.64 Â573.39 Â 0.76 100.00
>

This is interesting. it seems no reads are happening at all here.
I suspect that this and your observation that the real read throughput
is much higher, can be explained because your readers mostly read from
the page cache.
Can you describe better how the test workload is structured,
especially regarding cache?
How do you say that some tars are just readers, and some are just
writers? I suppose you pre-fault-in the input for writers, and you
direct output to /dev/null for readers? Are all readers reading from
different directories?

Few things to check:
* are the cpus saturated during the test?
* are the readers mostly in state 'D', or 'S', or 'R'?
* did you try 'as' I/O scheduler?
* how big are your volumes?
* what is the average load on the system?
* with i/o controller patches, what happens if you put readers in one
domain and writers in the other?

Are you willing to test some patches? I'm working on patches to reduce
read latency, that may be interesting to you.

Corrado

> cheers
> tobi
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch tobi@xxxxxxxxxx ++41 62 775 9902 / sb: -9900



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo@xxxxxxxxx
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/