Re: [PATCH v5 1/3] sched/fair: Introduce the burstable CFS controller

From: Odin Ugedal
Date: Thu May 20 2021 - 10:05:12 EST


Hi,

Here are some more thoughts and questions:

> The benefit of burst is seen when testing with schbench:
>
> echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
> echo 600000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
> echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_period_us
> echo 400000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us
>
> # The average CPU usage is around 500%, which is 200ms CPU time
> # every 40ms.
> ./schbench -m 1 -t 30 -r 10 -c 10000 -R 500
>
> Without burst:
>
> Latency percentiles (usec)
> 50.0000th: 7
> 75.0000th: 8
> 90.0000th: 9
> 95.0000th: 10
> *99.0000th: 933
> 99.5000th: 981
> 99.9000th: 3068
> min=0, max=20054
> rps: 498.31 p95 (usec) 10 p99 (usec) 933 p95/cputime 0.10% p99/cputime 9.33%

It should be noted that this was running on a 64 core machine (if that was
the case, ref. your previous patch).

I am curious how much you have tried tweaking both the period and the quota
for this workload. I assume a longer period can help such bursty application,
and from the small slowdowns, a slightly higher quota could also help
I guess. I am
not saying this is a bad idea, but that we need to understand what it
fixes, and how,
in order to be able to understand how/if to use it.

Also, what value of the sysctl kernel.sched_cfs_bandwidth_slice_us are
you using?
What CONFIG_HZ you are using is also interesting, due to how bw is
accounted for.
There is some more info about it here: Documentation/scheduler/sched-bwc.rst. I
assume a smaller slice value may also help, and it would be interesting to see
what implications it gives. A high threads to (quota/period) ratio, together
with a high bandwidth_slice will probably cause some throttling, so one has
to choose between precision and overhead.

Also, here you give a burst of 66% the quota. Would that be a typical value
for a cgroup, or is it just a result of testing? As I understand this
patchset, your example
would allow 600% constant CPU load, then one period with 1000% load,
then another
"long set" of periods with 600% load. Have you discussed a way of limiting how
long burst can be "saved" before expiring?

> @@ -9427,7 +9478,8 @@ static int cpu_max_show(struct seq_file *sf, void *v)
> {
> struct task_group *tg = css_tg(seq_css(sf));
>
> - cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg));
> + cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg),
> + tg_get_cfs_burst(tg));
> return 0;
> }

The current cgroup v2 docs say the following:

> cpu.max
> A read-write two value file which exists on non-root cgroups.
> The default is "max 100000".

This will become a "three value file", and I know a few user space projects
who parse this file by splitting on the middle space. I am not sure if they are
"wrong", but I don't think we usually break such things. Not sure what
Tejun thinks about this.

Thanks
Odin