Re: [Fwd: [RFC] cfq: adapt slice to number of processes doing I/O (v2.1)]

From: Corrado Zoccolo
Date: Fri Sep 25 2009 - 04:06:20 EST

Next message: Andrey Panin: "Re: [PATCH 4/5 v2] tracing/filters: Provide basic regex support"
Previous message: Frederic Weisbecker: "Re: [GIT PULL v2] bkl tracepoints + filter regex support"
In reply to: Shan Wei: "Re: [Fwd: [RFC] cfq: adapt slice to number of processes doing I/O(v2.1)]"
Next in thread: Tobias Oetiker: "Re: [Fwd: [RFC] cfq: adapt slice to number of processes doing I/O(v2.1)]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Shan,
On Thu, Sep 24, 2009 at 11:21 AM, Shan Wei <shanwei@xxxxxxxxxxxxxx> wrote:
>> Subject:
>> [RFC] cfq: adapt slice to number of processes doing I/O (v2.1)
>>
>> When the number of processes performing I/O concurrently increases,
>> a fixed time slice per process will cause large latencies.
>>
>> This (v2.1) patch will scale the time slice assigned to each process,
>> according to a target latency (tunable from sysfs, default 300ms).
>>
>> In order to keep fairness among processes, we adopt two devices, w.r.t. v1.
>>
>> * The number of active processes is computed using a special form of
>> running average, that quickly follows sudden increases (to keep latency low),
>> and decrease slowly (to have fairness in spite of rapid decreases of this
>> value).
>>
>> * The idle time is computed using remaining slice as a maximum.
>>
>> To safeguard sequential bandwidth, we impose a minimum time slice
>> (computed using 2*cfq_slice_idle as base, adjusted according to priority
>> and async-ness).
>>
>> Signed-off-by: Corrado Zoccolo <czoccolo@xxxxxxxxx>
>>
>
> I'm interested in the idea of dynamically tuning the time slice according to the number of
> processes. I have tested your patch using Jeff's tool on kernel-2.6.30-rc4.
> From the following test result, after applying your patch, the fairness is not good
> as original kernel, e.g. io priority of 4 vs 5 in be0-through-7.fio case.

In Jeff's test, the results are always in the same order, so you can
compare different runs, e.g. be4-x-8.fio with be0-through-7.fio.
If we compare the processes that caused the priority inversion, i.e.
5th and 6th process, in the equal priority case we have:

for original cfq:
(5th) be 4 64548 42388 -35
(6th) be 4 64548 73780 14

for patched cfq:
(5th) be 4 57384 55028 -5
(6th) be 4 57384 69620 21

It is clear that the 5th file has slower (between 30% and 50%) access
than the 6th (maybe caused by fragmentation, or bad disk placement),
so this can easily explain the 11% priority inversion.
(5th) be 4 56116 18420 -68
(6th) be 5 44893 19700 -57

BTW, with smaller slices, even a single seek while performing
sequential read may cause a bigger degradation, so the effect on the
patched cfq can be more evident.

>From your numbers, actually, I see much better fairness than what Jeff
experienced with the first version of my patch, so the improvements
have actually paid out.

> And the throughout(total data transferred) becomes lower.
That is expected. We trade some throughput for better latency. You can
increase the target_latency to have back your throughput (e.g. for
non-interactive servers). Decreasing target_latency too much will not
improve the latency indefinitely, since we have a lower bound for the
slice, though (and hardware limitations).

> Have you tested buffered write, multi-threads?
Yes. One interesting test was done by Tobias Oetiker
(http://oss.oetiker.ch/optools/wiki/fsopbench) on his Areca HW Raid6
with ext3. He put target_latency to 200ms, and used a lower
slice_idle, to match his faster disks.

Here the writers are creating many files (metadata & journal writes)
of varying sizes (the distribution is a realistic one, that simulates
an user home), writing to them (this will cause buffered writes for
big files), and then syncing them (also sync writes, both to data and
metadata), so this is a torture test for ext3.
There is just 1 reader, instead, that is trying to access files and
measuring the latencies in doing so.

On 2.6.31 with cfq and ordered data it gives the following results

./fsopbench.pl --writers=6 --continuous /tmp/mnt/home_a/fsopbench_tree/

In-Competition with 6 Writers - Mode 30s Interval
-----------------------------------------------------------------
A read dir cnt 371 min 0.001 ms max 348.297 ms
mean 2.499 ms stdev 22.800
B stat file cnt 351 min 0.007 ms max 1130.490 ms
mean 4.881 ms stdev 61.887
C open file cnt 293 min 0.014 ms max 0.195 ms
mean 0.022 ms stdev 0.017
D rd 1st byte cnt 293 min 0.178 ms max 1117.346 ms
mean 57.902 ms stdev 153.807
E read rate 1.123 MB/s

On 2.6.31 with my cfq patch

In-Competition with 6 Writers - Mode 31s Interval
-----------------------------------------------------------------
A read dir cnt 388 min 0.001 ms max 240.531 ms
mean 1.145 ms stdev 12.878
B stat file cnt 366 min 0.006 ms max 199.893 ms
mean 0.920 ms stdev 10.840
C open file cnt 300 min 0.014 ms max 0.340 ms
mean 0.025 ms stdev 0.028
D rd 1st byte cnt 300 min 0.184 ms max 1443.066 ms
mean 69.599 ms stdev 193.984
E read rate 1.023 MB/s

As you can see, my patch inproved both average and maximum latency for
readdir and stat, while reading the first byte needed more time, and
read throughput decreased 10%.
The total time in the worst case to reach a file and read data from it
decreased from 2.5s to about 1.8s, and also average case gained few
ms.

>
> Additionally i have a question about the minimum time slice, see the comment in your patch.
>
> *Original*(2.6.30-rc4 without patch):
> /cfq-regression-tests/2.6.30-rc4-log/be0-through-7.fio
> total priority: 880
> total data transferred: 535872
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â0 Â Â Â 109610 Â149748 Â36
> be Â Â Â1 Â Â Â 97431 Â 104436 Â7
> be Â Â Â2 Â Â Â 85252 Â 91124 Â 6
> be Â Â Â3 Â Â Â 73073 Â 64244 Â -13
> be Â Â Â4 Â Â Â 60894 Â 59028 Â -4
> be Â Â Â5 Â Â Â 48715 Â 38132 Â -22
> be Â Â Â6 Â Â Â 36536 Â 21492 Â -42
> be Â Â Â7 Â Â Â 24357 Â 7668 Â Â-69
>
> /cfq-regression-tests/2.6.30-rc4-log/be0-vs-be1.fio
> total priority: 340
> total data transferred: 556008
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â0 Â Â Â 294357 Â402164 Â36
> be Â Â Â1 Â Â Â 261650 Â153844 Â-42
>
> /cfq-regression-tests/2.6.30-rc4-log/be0-vs-be7.fio
> total priority: 220
> total data transferred: 537064
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â0 Â Â Â 439416 Â466164 Â6
> be Â Â Â7 Â Â Â 97648 Â 70900 Â -28
>
> /cfq-regression-tests/2.6.30-rc4-log/be4-x-3.fio
> total priority: 300
> total data transferred: 532964
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â4 Â Â Â 177654 Â199260 Â12
> be Â Â Â4 Â Â Â 177654 Â149748 Â-16
> be Â Â Â4 Â Â Â 177654 Â183956 Â3
>
> /cfq-regression-tests/2.6.30-rc4-log/be4-x-8.fio
> total priority: 800
> total data transferred: 516384
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â4 Â Â Â 64548 Â 78580 Â 21
> be Â Â Â4 Â Â Â 64548 Â 76436 Â 18
> be Â Â Â4 Â Â Â 64548 Â 75764 Â 17
> be Â Â Â4 Â Â Â 64548 Â 70900 Â 9
> be Â Â Â4 Â Â Â 64548 Â 42388 Â -35
> be Â Â Â4 Â Â Â 64548 Â 73780 Â 14
> be Â Â Â4 Â Â Â 64548 Â 30708 Â -53
> be Â Â Â4 Â Â Â 64548 Â 67828 Â 5
>
>
> *Applied patch*(2.6.30-rc4 with patch):
>
> /cfq-regression-tests/log-result/be0-through-7.fio
> total priority: 880
> total data transferred: 493824
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â0 Â Â Â 101009 Â224852 Â122
> be Â Â Â1 Â Â Â 89786 Â 106996 Â19
> be Â Â Â2 Â Â Â 78562 Â 70388 Â -11
> be Â Â Â3 Â Â Â 67339 Â 38900 Â -43
> be Â Â Â4 Â Â Â 56116 Â 18420 Â -68
> be Â Â Â5 Â Â Â 44893 Â 19700 Â -57
> be Â Â Â6 Â Â Â 33669 Â 9972 Â Â-71
> be Â Â Â7 Â Â Â 22446 Â 4596 Â Â-80
>
> /cfq-regression-tests/log-result/be0-vs-be1.fio
> total priority: 340
> total data transferred: 537064
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â0 Â Â Â 284328 Â375540 Â32
> be Â Â Â1 Â Â Â 252736 Â161524 Â-37
>
> /cfq-regression-tests/log-result/be0-vs-be7.fio
> total priority: 220
> total data transferred: 551912
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â0 Â Â Â 451564 Â499956 Â10
> be Â Â Â7 Â Â Â 100347 Â51956 Â -49
>
> /cfq-regression-tests/log-result/be4-x-3.fio
> total priority: 300
> total data transferred: 509404
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â4 Â Â Â 169801 Â196596 Â15
> be Â Â Â4 Â Â Â 169801 Â198388 Â16
> be Â Â Â4 Â Â Â 169801 Â114420 Â-33
>
> /cfq-regression-tests/log-result/be4-x-8.fio
> total priority: 800
> total data transferred: 459072
> class Â prio Â Âideal Â xferred %diff
> be Â Â Â4 Â Â Â 57384 Â 70644 Â 23
> be Â Â Â4 Â Â Â 57384 Â 52980 Â -8
> be Â Â Â4 Â Â Â 57384 Â 62356 Â 8
> be Â Â Â4 Â Â Â 57384 Â 60660 Â 5
> be Â Â Â4 Â Â Â 57384 Â 55028 Â -5
> be Â Â Â4 Â Â Â 57384 Â 69620 Â 21
> be Â Â Â4 Â Â Â 57384 Â 51956 Â -10
> be Â Â Â4 Â Â Â 57384 Â 35828 Â -38
>
>
> Hardware infos.:
> CPU: GenuineIntel Intel(R) Xeon(TM) CPU 3.00GHz
> Â Â (4 logic cpus with hyper-thread on)
> memory:2G
> HDD:scsi
>
>> ---
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 0e3814b..ca90d42 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -27,6 +27,8 @@ static const int cfq_slice_sync = HZ / 10;
>> Âstatic int cfq_slice_async = HZ / 25;
>> Âstatic const int cfq_slice_async_rq = 2;
>> Âstatic int cfq_slice_idle = HZ / 125;
>> +static int cfq_target_latency = HZ * 3/10; /* 300 ms */
>> +static int cfq_hist_divisor = 4;
>>
>> Â/*
>> Â * offset from end of service tree
>> @@ -134,6 +136,9 @@ struct cfq_data {
>> Â Â Â struct rb_root prio_trees[CFQ_PRIO_LISTS];
>>
>> Â Â Â unsigned int busy_queues;
>> + Â Â unsigned int busy_queues_avg;
>> + Â Â unsigned int busy_rt_queues;
>> + Â Â unsigned int busy_rt_queues_avg;
>>
>> Â Â Â int rq_in_driver[2];
>> Â Â Â int sync_flight;
>> @@ -173,6 +178,8 @@ struct cfq_data {
>> Â Â Â unsigned int cfq_slice[2];
>> Â Â Â unsigned int cfq_slice_async_rq;
>> Â Â Â unsigned int cfq_slice_idle;
>> + Â Â unsigned int cfq_target_latency;
>> + Â Â unsigned int cfq_hist_divisor;
>>
>> Â Â Â struct list_head cic_list;
>>
>> @@ -301,10 +308,40 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> Â Â Â return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
>> Â}
>>
>> +static inline unsigned
>> +cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
>> + Â Â unsigned min_q, max_q;
>> + Â Â unsigned mult = cfqd->cfq_hist_divisor - 1;
>> + Â Â unsigned round = cfqd->cfq_hist_divisor / 2;
>> + Â Â if (rt) {
>> + Â Â Â Â Â Â min_q = min(cfqd->busy_rt_queues_avg, cfqd->busy_rt_queues);
>> + Â Â Â Â Â Â max_q = max(cfqd->busy_rt_queues_avg, cfqd->busy_rt_queues);
>> + Â Â Â Â Â Â cfqd->busy_rt_queues_avg = (mult * max_q + min_q + round) /
>> + Â Â Â Â Â Â Â Â Â Â cfqd->cfq_hist_divisor;
>> + Â Â Â Â Â Â return cfqd->busy_rt_queues_avg;
>> + Â Â } else {
>> + Â Â Â Â Â Â min_q = min(cfqd->busy_queues_avg, cfqd->busy_queues);
>> + Â Â Â Â Â Â max_q = max(cfqd->busy_queues_avg, cfqd->busy_queues);
>> + Â Â Â Â Â Â cfqd->busy_queues_avg = (mult * max_q + min_q + round) /
>> + Â Â Â Â Â Â Â Â Â Â cfqd->cfq_hist_divisor;
>> + Â Â Â Â Â Â return cfqd->busy_queues_avg;
>> + Â Â }
>> +}
>> +
>> Âstatic inline void
>> Âcfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> Â{
>> - Â Â cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
>> + Â Â unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
>> + Â Â unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
>> + Â Â unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
>> +
>> + Â Â if (iq > process_thr) {
>> + Â Â Â Â Â Â unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
>> + Â Â Â Â Â Â Â Â Â Â / cfqd->cfq_slice[1];
>
> For sync queue, the minimum time slice is decided by slice_idle, base time slice and io priority.
> But for async queue, why is the minimum time slice also limited by base time slice of sync queue?

You should consider that slice, computed calling cfq_prio_to_slice,
will already be:
for sync: cfqd->cfq_slice[1] altered to consider the priority
for async: cfqd->cfq_slice[0] altered to consider the priority
Now, I take that number, and multiply it by twice the slice_idle, and
divide it by cfqd->cfq_slice[1].
So it becomes (approximately):
for sync: 2 * slice altered to considet the priority
for async: 2 * slice * cfqd->cfq_slice[0]/cfqd->cfq_slice[1] altered
to considet the priority
Now if we ignore priorities for a while, we can see that the ratio
between sync slice and async slice in original cfq is:
cfqd->cfq_slice[1]/cfqd->cfq_slice[0]
and in my patch:
2 * slice / (2 * slice * cfqd->cfq_slice[0]/cfqd->cfq_slice[1]) =
cfqd->cfq_slice[1]/cfqd->cfq_slice[0]
i.e. exactly the same.
So the intent is to preserve the ratio between sync and async slices.

Corrado

>
>
> Best Regards
> -----
> Shan Wei
>
>
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo@xxxxxxxxx
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrey Panin: "Re: [PATCH 4/5 v2] tracing/filters: Provide basic regex support"
Previous message: Frederic Weisbecker: "Re: [GIT PULL v2] bkl tracepoints + filter regex support"
In reply to: Shan Wei: "Re: [Fwd: [RFC] cfq: adapt slice to number of processes doing I/O(v2.1)]"
Next in thread: Tobias Oetiker: "Re: [Fwd: [RFC] cfq: adapt slice to number of processes doing I/O(v2.1)]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]