Re: [RFC PATCH] sched/eevdf: Use tunable knob sysctl_sched_base_slice as explicit time quanta

From: Vishal Chourasia
Date: Tue Jan 23 2024 - 07:42:58 EST


On Thu, Jan 11, 2024 at 06:57:46AM -0500, Ze Gao wrote:
> AFAIS, We've overlooked what role of the concept of time quanta plays
> in EEVDF. According to Theorem 1 in [1], we have
>
> -r_max < log_k(t) < max(r_max, q)
>
> cleary we don't want either r_max (the maximum user request) or q (time
> quanta) to be too much big.
>
> To trade for throughput, in [2] it chooses to do tick preemtion at
> per request boundary (i.e., once a cetain request is fulfilled), which
> means we literally have no concept of time quanta defined anymore.
> Obviously this is no problem if we make
>
> q = r_i = sysctl_sched_base_slice
>
> just as exactly what we have for now, which actually creates a implict
> quanta for us and works well.
>
> However, with custom slice being possible, the lag bound is subject
> only to the distribution of users requested slices given the fact no
> time quantum is available now and we would pay the cost of losing
> many scheduling opportunities to maintain fairness and responsiveness
> due to [2]. What's worse, we may suffer unexpected unfairness and
> lantecy.
>
> For example, take two cpu bound processes with the same weight and bind
> them to the same cpu, and let process A request for 100ms whereas B
> request for 0.1ms each time (with HZ=1000, sysctl_sched_base_slice=3ms,
> nr_cpu=42). And we can clearly see that playing with custom slice can
> actually incur unfair cpu bandwidth allocation (10706 whose request
> length is 0.1ms gets more cpu time as well as better latency compared to
> 10705. Note you might see the other way around in different machines but
> the allocation inaccuracy retains, and even top can show you the
> noticeble difference in terms of cpu util by per second reporting), which
> is obviously not what we want because that would mess up the nice system
> and fairness would not hold.

Hi, How are you setting custom request values for process A and B?

>
> stress-ng-cpu:10705 stress-ng-cpu:10706
> ---------------------------------------------------------------------
> Slices(ms) 100 0.1
> Runtime(ms) 4934.206 5025.048
> Switches 58 67
> Average delay(ms) 87.074 73.863
> Maximum delay(ms) 101.998 101.010
>
> In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> in this patch gives us a better control of the allocation accuracy and
> the avg latency:
>
> stress-ng-cpu:10584 stress-ng-cpu:10583
> ---------------------------------------------------------------------
> Slices(ms) 100 0.1
> Runtime(ms) 4980.309 4981.356
> Switches 1253 1254
> Average delay(ms) 3.990 3.990
> Maximum delay(ms) 5.001 4.014
>
> Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> less switches at the cost of worse delay:
>
> stress-ng-cpu:11208 stress-ng-cpu:11207
> ---------------------------------------------------------------------
> Slices(ms) 100 0.1
> Runtime(ms) 4983.722 4977.035
> Switches 456 456
> Average delay(ms) 10.963 10.939
> Maximum delay(ms) 19.002 21.001
>
> By being able to tune sysctl_sched_base_slice knob, we can achieve
> the goal to strike a good balance between throughput and latency by
> adjusting the frequency of context switches, and the conclusions are
> much close to what's covered in [1] with the explicit definition of
> a time quantum. And it aslo gives more freedom to choose the eligible
> request length range(either through nice value or raw value)
> without worrying about overscheduling or underscheduling too much.
>
> Note this change should introduce no obvious regression because all
> processes have the same request length as sysctl_sched_base_slice as
> in the status quo. And the result of benchmarks proves this as well.
>
> schbench -m2 -F128 -n10 -r90 w/patch tip/6.7-rc7
> Wakeup (usec): 99.0th: 3028 95
> Request (usec): 99.0th: 14992 21984
> RPS (count): 50.0th: 5864 5848
>
> hackbench -s 512 -l 200 -f 25 -P w/patch tip/6.7-rc7
> -g 10 0.212 0.223
> -g 20 0.415 0.432
> -g 30 0.625 0.639
> -g 40 0.852 0.858
>
> [1]: https://dl.acm.org/doi/10.5555/890606
> [2]: https://lore.kernel.org/all/20230420150537.GC4253@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#u
>
> Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
> ---