Re: [question] sched: idle_avg and migration latency

From: Daniel Lezcano
Date: Tue Dec 10 2013 - 13:31:11 EST


On 12/10/2013 04:11 PM, Mike Galbraith wrote:
On Tue, 2013-12-10 at 12:30 +0100, Daniel Lezcano wrote:
Hi All,

I am trying to understand how is computed the idle_avg and how it is
used regarding the migration latency.

1. What is the sysctl_sched_migration_cost value ? It is initialized to
500000UL. Is it an arbitrarily chosen value ? Could it change depending
on the hardware performances ?

Yeah, it's a magic number. We used to use boot time measurements.

2. The idle_balance function checks:

if (this_rq->avg_idle < sysctl_sched_migration_cost)
return 0;

IIUC, it is not worth to migrate a task to this cpu as we expect to run
another task before we can pull a task to the current cpu, right ?

No, that's all about not beating living hell outta ourselves on every
micro-idle. As with all load balancing, it's usually too much balancing
that creates a problem. You need it, but it's really expensive, so less
is more.

Then if there is no task to balance we will enter idle, thus we
initialize the idle_stamp to the current clock.

When another task is woken up with the ttwu_do_wakeup, the duration of
the idle time is computed in there:

if (rq->idle_stamp) {
u64 delta = rq_clock(rq) - rq->idle_stamp;
u64 max = 2*sysctl_sched_migration_cost;

if (delta > max)
rq->avg_idle = max;
else
update_avg(&rq->avg_idle, delta);
rq->idle_stamp = 0;
}

Why is the 'delta' leveraged by 'max' ?

That has changed a little recently. I originally slammed avg_idle
itself straight to max to ensure that a bursty load would idle balance,
and not use stale data. If you start cross core switching at high
frequency, you'll still shut idle balancing quickly.

Ok, thanks for the explanation.

I think I am a bit puzzled with the 'idle_avg' name. I am guessing the semantic of this variable is "how long this cpu has been idle".

The idle duration, with the no_hz, could be long, several seconds if the work queues have been migrated and if the timer affinity is set to another cpu. So if we fall in this case and there is a burst of activity + micro-idle and idle_avg is not leverage to max, it will stay high during an amount of time, thus pulling tasks at each micro idle period, right ?

3. And finally the function update_avg does:

s64 diff = sample - *avg;
*avg += diff >> 3;

Why is diff >> 3 used instead of the number of values ?

Ingo's quick like bunny smooth average.

Yeah, average computation on-the-fly. But why 'divide by 8' ? (Cc'ed Ingo).

Thanks for taking the time to answer.

-- Daniel

--
<http://www.linaro.org/> Linaro.org â Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/