Re: [PATCH v3] sched/fair: limit sched slice duration

From: Dietmar Eggemann
Date: Thu Oct 06 2022 - 07:35:10 EST


On 03/10/2022 14:21, Vincent Guittot wrote:
> In presence of a lot of small weight tasks like sched_idle tasks, normal
> or high weight tasks can see their ideal runtime (sched_slice) to increase
> to hundreds ms whereas it normally stays below sysctl_sched_latency.
>
> 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> (half of the sched_period). This means that they will make progress
> every sysctl_sched_latency period.
>
> If we now add 1000 idle tasks on the CPU, the sched_period becomes
> 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> This means that the scheduler will look for picking another waiting task
> after 609ms running time (1500ms respectively). The idle tasks change
> significantly the way the 2 normal tasks interleave their running time
> slot whereas they should have a small impact.
>
> Such long sched_slice can delay significantly the release of resources
> as the tasks can wait hundreds of ms before the next running slot just
> because of idle tasks queued on the rq.
>
> Cap the ideal_runtime to sysctl_sched_latency to make sure that tasks will
> regularly make progress and will not be significantly impacted by
> idle/background tasks queued on the rq.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
>
> Change since v2:
> - Cap ideal_runtime from the beg as suggested by Peter
>
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5ffec4370602..c309d57efb2c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4584,7 +4584,13 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> struct sched_entity *se;
> s64 delta;
>
> - ideal_runtime = sched_slice(cfs_rq, curr);
> + /*
> + * When many tasks blow up the sched_period; it is possible that
> + * sched_slice() reports unusually large results (when many tasks are
> + * very light for example). Therefore impose a maximum.
> + */
> + ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> +
> delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> if (delta_exec > ideal_runtime) {
> resched_curr(rq_of(cfs_rq));

Tested on 6 CPU system (sysctl_sched_latency=18ms,
sysctl_sched_min_granularity=2.25ms)

I start to see `slice > period` when I run:

(a) > ~50 idle tasks in '/' for an arbitrary nice=0 task

(b) > ~50 nice=0 tasks in '/A' w/ cpu.shares = max for se of '/A'

Essentially in moments in which cfs_rq->nr_running > sched_nr_latency
and se_weight is relatively high compared to cfs_rq_weight.

Tested-By: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>