Re: CFS Bandwidth Control - Test results of cgroups tasks pinned vsunpinnede

From: Srivatsa Vaddagiri
Date: Thu Sep 08 2011 - 11:15:46 EST


* Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> [2011-09-07 21:22:22]:

> On Wed, 2011-09-07 at 20:50 +0530, Srivatsa Vaddagiri wrote:
> >
> > Fix excessive idle time reported when cgroups are capped.
>
> Where from? The whole idea of bandwidth caps is to introduce idle time,
> so what's excessive and where does it come from?

We have setup cgroups and their hard limits so that in theory they should
consume the entire capacity available on machine, leading to 0% idle time.
That's not what we see. A more detailed description of the setup and the problem
is here:

https://lkml.org/lkml/2011/6/7/352

but to quickly summarize it, the machine and the test-case is as below:

Machine : 16-cpus (2 Quad-core w/ HT enabled)
Cgroups : 5 in number (C1-C5), each having {2, 2, 4, 8, 16} tasks respectively.
Further, each task is placed in its own (sub-)cgroup with
a capped usage of 50% CPU.

/C1/C1_1/Task1 -> capped at 50% cpu usage
/C1/C1_2/Task2 -> capped at 50% cpu usage
/C2/C2_1/Task3 -> capped at 50% cpu usage
/C2/C2_2/Task3 -> capped at 50% cpu usage
/C3/C3_1/Task4 -> capped at 50% cpu usage
/C3/C3_2/Task4 -> capped at 50% cpu usage
/C3/C3_3/Task4 -> capped at 50% cpu usage
/C3/C3_4/Task4 -> capped at 50% cpu usage
...
/C5/C5_16/Task32 -> capped at 50% cpu usage

So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU
system. One can expect 0% idle time in this scenario, which was found
not to be the case. With early versions of cfs hardlimits, upto ~20%
idle time was seen, though with the current version in tip, we see upto
~10% idle time (when cfs.period = 100ms) which goes down to ~5% when
cfs.period is set to 500ms.

>From what I could find out, the "excess" idle time crops up because
load-balancer is not perfect. For example, there are instances when a
CPU has just 1 task on its runqueue (rather then the ideal number of 2
tasks/cpu). When that lone task exceeds its 50% limit, cpu is forced to
become idle.

> > The patch introduces the notion of "steal"
>
> The virt folks already claimed steal-time and have it mean something
> entirely different. You get to pick a new name.

grace time?

> > (or "grace") time which is the surplus
> > time/bandwidth each cgroup is allowed to consume, subject to a maximum
> > steal time (sched_cfs_max_steal_time_us). Cgroups are allowed this "steal"
> > or "grace" time when the lone task running on a cpu is about to be throttled.
>
> Ok, so this is a solution to an unstated problem. Why is it a good
> solution?

I am not sure if there are any "good" solutions to this problem! One
possibility is to make the idle load balancer become aggressive in
pulling tasks across sched-domain boundaries i.e when a CPU becomes idle
(after a task got throttled) and invokes the idle load balancer, it
should try "harder" at pulling a task from far-off cpus (across
package/node boundaries)?

> Also, another tunable, yay!

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/