Re: unfair io behaviour for high load interactive use still presentin 2.6.31

From: Tobias Oetiker
Date: Wed Sep 16 2009 - 03:54:22 EST


HI Corrado,

Today Corrado Zoccolo wrote:

> Hi Tobias,
> On Tue, Sep 15, 2009 at 11:07 PM, Tobias Oetiker <tobi@xxxxxxxxxx> wrote:
> >
> > Device:     rrqm/s  wrqm/s   r/s   w/s  ÂrMB/s  ÂwMB/s avgrq-sz avgqu-sz  await Âsvctm Â%util
> > ---------------------------------------------------------------------------------------------------------
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2566.80 Â Â 0.00 Â Â10.03 Â Â 8.00 Â2224.74 Â737.62 Â 0.39 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â9.60 Â679.00 Â Â 0.04 Â Â 2.65 Â Â 8.00 Â 400.41 1029.73 Â 1.35 Â92.80
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2080.80 Â Â 0.00 Â Â 8.13 Â Â 8.00 Â 906.58 Â456.45 Â 0.48 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2349.20 Â Â 0.00 Â Â 9.18 Â Â 8.00 Â1351.17 Â491.44 Â 0.43 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â3.80 Â665.60 Â Â 0.01 Â Â 2.60 Â Â 8.00 Â 906.72 1098.75 Â 1.39 Â93.20
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1811.20 Â Â 0.00 Â Â 7.07 Â Â 8.00 Â1008.23 Â725.34 Â 0.55 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 2632.60 Â Â 0.00 Â Â10.28 Â Â 8.00 Â1651.18 Â640.61 Â 0.38 100.00
> >
>
> Good.
> The high await is normal for writes, especially since you get so many
> queued requests.
> can you post the output of "grep -r . /sys/block/_device_/queue/" and
> iostat for your real devices?
> This should not affect reads, that will preempt writes with cfq.

/sys/block/sdc/queue/nr_requests:128
/sys/block/sdc/queue/read_ahead_kb:128
/sys/block/sdc/queue/max_hw_sectors_kb:2048
/sys/block/sdc/queue/max_sectors_kb:512
/sys/block/sdc/queue/scheduler:noop anticipatory deadline [cfq]
/sys/block/sdc/queue/hw_sector_size:512
/sys/block/sdc/queue/logical_block_size:512
/sys/block/sdc/queue/physical_block_size:512
/sys/block/sdc/queue/minimum_io_size:512
/sys/block/sdc/queue/optimal_io_size:0
/sys/block/sdc/queue/rotational:1
/sys/block/sdc/queue/nomerges:0
/sys/block/sdc/queue/rq_affinity:0
/sys/block/sdc/queue/iostats:1
/sys/block/sdc/queue/iosched/quantum:4
/sys/block/sdc/queue/iosched/fifo_expire_sync:124
/sys/block/sdc/queue/iosched/fifo_expire_async:248
/sys/block/sdc/queue/iosched/back_seek_max:16384
/sys/block/sdc/queue/iosched/back_seek_penalty:2
/sys/block/sdc/queue/iosched/slice_sync:100
/sys/block/sdc/queue/iosched/slice_async:40
/sys/block/sdc/queue/iosched/slice_async_rq:2
/sys/block/sdc/queue/iosched/slice_idle:8

but as I said in my original post. I have done extensive tests,
twiddly all the knobs I know of and the only thing that never
changed, was that as soon as writers come into play, the readers
get starved pretty thorowly.

> >
> > in the read/write test the write rate stais down, but the rMB/s is even worse and the await is also way up,
> > so I guess the bad performance is not to blame on the the cache in the controller ...
> >
> > Device:     rrqm/s  wrqm/s   r/s   w/s  ÂrMB/s  ÂwMB/s avgrq-sz avgqu-sz  await Âsvctm Â%util
> > ---------------------------------------------------------------------------------------------------------
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1225.80 Â Â 0.00 Â Â 4.79 Â Â 8.00 Â1050.49 Â807.38 Â 0.82 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1721.80 Â Â 0.00 Â Â 6.73 Â Â 8.00 Â1823.67 Â807.20 Â 0.58 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1128.00 Â Â 0.00 Â Â 4.41 Â Â 8.00 Â 617.94 Â832.52 Â 0.89 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 Â838.80 Â Â 0.00 Â Â 3.28 Â Â 8.00 Â 873.04 1056.37 Â 1.19 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â 39.60 Â347.80 Â Â 0.15 Â Â 1.36 Â Â 8.00 Â 590.27 1880.05 Â 2.57 Â99.68
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1626.00 Â Â 0.00 Â Â 6.35 Â Â 8.00 Â 983.85 Â452.72 Â 0.62 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1117.00 Â Â 0.00 Â Â 4.36 Â Â 8.00 Â1047.16 1117.78 Â 0.90 100.00
> > dm-18 Â Â Â Â Â Â 0.00 Â Â 0.00 Â Â0.00 1319.20 Â Â 0.00 Â Â 5.15 Â Â 8.00 Â 840.64 Â573.39 Â 0.76 100.00
> >
>
> This is interesting. it seems no reads are happening at all here.
> I suspect that this and your observation that the real read throughput
> is much higher, can be explained because your readers mostly read from
> the page cache.


> Can you describe better how the test workload is structured,
> especially regarding cache?
> How do you say that some tars are just readers, and some are just
> writers? I suppose you pre-fault-in the input for writers, and you
> direct output to /dev/null for readers? Are all readers reading from
> different directories?

they are NOT ... what I do is this (http://tobi.oetiker.ch/fspunisher).

* get a copy of the linux kernel source an place it on tmpfs

* unpack 4 copies of the linux kernel for each reader process

* sync and echo 3 >/proc/sys/vm/drop_caches (this should loose all
cache)

* start the readers, each on its private copy of the kernel source,
as prepared above writing their output to /dev/null.

* start an equal amount of tars unpacking the kernel archive from
tmpfs into separate directories, next to the readers source
directories.

the goal of this exercise is to simulate independent writers and
readers while excluding the cache as much as possible since I want
to see how the system deals with accessing the actual disk device
and not the local cache.
>
> Few things to check:
> * are the cpus saturated during the test?
nope

> * are the readers mostly in state 'D', or 'S', or 'R'?

they are almost alway in D (same as the writers) sleeping for IO

> * did you try 'as' I/O scheduler?

yes, this helps the readers a bit
but it does not help the 'smoothness'
of the system operation

> * how big are your volumes?

100 G

> * what is the average load on the system?

24 (or however many tars I start since they al want to run all the
time and are all waiting for IO)

> * with i/o controller patches, what happens if you put readers in one
> domain and writers in the other?

will try ... note though that this would have to happen
automatically since I can not know before hand who writes and who
reads ...

> Are you willing to test some patches? I'm working on patches to reduce
> read latency, that may be interesting to you.

by all means ... to repeat my goal, I want to see the readers not
starved by the writers, also I have a hunch that the handling of
metadata updates/access in the filesystem may play a role here.
In the dd tests these do not play a role but in real life I find
that often the problem is getting at the file, not reading the file
once I have it ... this is especially tricky to test since there is
good caching on this and this can pretty quickly distort the
results if not handled carefully.

cheers
tobi

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@xxxxxxxxxx ++41 62 775 9902 / sb: -9900