Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

From: Valentin Schneider
Date: Tue Jan 30 2018 - 06:41:42 EST


(Resending because I snuck in some HTML... Apologies)

On 01/30/2018 08:32 AM, Vincent Guittot wrote:
On 29 January 2018 at 20:31, Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
Hi Vincent, Peter,

I've been running some tests on your patches (Peter's base + the 2 from
Vincent). The results themselves are hosted at [1].
The base of those tests is the same: a task ("accumulator") is ran for 5
seconds (arbitrary value) to accumulate some load, then goes to sleep for .5
seconds.

I've set up 3 test scenarios:

Update by nohz_balance_kick()
-----------------------------
Right before the "accumulator" task goes to sleep, a CPU-hogging task (100%
utilization) is spawned on another CPU. It won't go idle so the only way to
update the blocked load generated by "accumulator" is to kick an ILB
(NOHZ_STATS_KICK).

The test shows that this is behaving nicely - we keep kicking an ILB every
~36ms (see next test for comments on that) until there is no more blocked
load. I did however notice some interesting scenarios: after the load has
been fully decayed, a tiny background task can spawn and end in less than a
scheduling period. However, it still goes through nohz_balance_enter_idle(),
and thus sets nohz.stats_state, which will later cause an ILB kick.

This makes me wonder if it's worth kicking ILBs for such tiny load values -
perhaps it could be worth having a margin to set rq->has_blocked_load ?

So it's difficult to know what will be the load/utilization on the
cfs_rq once the cpu wakes up. Even if it's for a really short time,
that's doesn't mean that the load/utilization is small because it can
be the migration of a big task that just have a very short wakes up
this time.
That's why I don't make any assumption on the utilization/load value
when a cpu goes to sleep


Right, hadn't thought about those kind of migrations.


Furthermore, this tiny task will cause the ILB to iterate over all of the
idle CPUs, although only one has stale load. For load update via NEWLY_IDLE
load_balance() we use:

static bool update_nohz_stats(struct rq *rq)
{
if (!rq->has_blocked_load)
return false;
[...]
}

But for load update via _nohz_idle_balance(), we iterate through all of the
nohz CPUS and unconditionally call update_blocked_averages(). This could be
avoided by remembering which CPUs have stale load before going idle.
Initially I thought that was what nohz.stats_state was for, but it isn't.
With Vincent's patches it's only ever set to either 0 or 1, but we could use
it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
_nohz_idle_balance() (when NOHZ_STATS_KICK).

I have studied a way to keep track of how many cpus still have blocked
load to try to minimize the number of useless ilb kick but this add
more atomic operations which can impact the system throughput with
heavy load and lot of very small wake up. that's why i have propose
this solution which is more simple. But it's probably just a matter of
where we want to "waste" time. Either we accept to spent a bit more
time to check the state of idle CPUs or we accept to kick ilb from
time to time for no good reason.


Agreed. I have the feeling that spending more time doing atomic ops could be worth it - I'll try to test this out and see if it's actually relevant.


Update by idle_balance()
------------------------
Right before the "accumulator" task goes to sleep, a tiny periodic
(period=32ms) task is spawned on another CPU. It's expected that it will
update the blocked load in idle_balance(), either by running
_nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't
be set in this test case, so we shouldn't go through the NEWLY_IDLE
load_balance()).

This also seems to be working fine, but I'm noticing a delay between load
updates that is closer to 64ms than 32ms. After digging into it I found out
that the time checks done in idle_balance() and nohz_balancer_kick() are
time_after(jiffies, next_stats), but IMHO they should be
time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
explains the 36ms periodicity of the updates in the test above.

I have use the 32ms as a minimum value between update. We must use the
time_after() if we want to have at least 32ms between each update. We
will have a 36ms period if the previous update was triggered by the
tick (just after in fact) but there will be only 32ms if the last
update was done during an idle_balance that happens just before the
tick. With time_after_eq, the update period will between 28 and
32ms.

Then, I mention a possible optimization by using time_after_eq in the
idle_balance() so a newly_idle cpu will have more chance (between 0
and 4ms for hz250) to do the update before a ilb is kicked


IIUC with time_after() the update period should be within ]32, 36] ms, but it looks like I'm always on that upper bound in my tests.

When evaluating whether we need to kick_ilb() for load updates, we'll always be right after the tick (excluding the case in idle_balance), which explains why we wait for an extra tick in the "update by nohz_balancer_kick()" test case.

The tricky part is that, as you say, the update by idle_balance() can happen anywhere between [0-4[ ms after a tick (or before, depending on how you see it), so using time_after_eq could make the update period < 32ms - and this also impacts a load update by nohz_balance_kick() if the previous update was done by idle_balance()... This is what causes the update period to be closer to 64ms in my test case, but it's somewhat artificial because I only have a 32ms-periodic task running - if there was any other task running the period could remain in that ]32, 36] ms interval.

Did I get that right ?

Thanks,
Vincent



No update (idle system)
-----------------------
Nothing special here, just making sure nothing happens when the system is
fully idle. On a sidenote, that's relatively hard to achieve - I had to
switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno
still gets woken up every now and then but it's a bit quieter.


[1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0



On 01/24/2018 08:25 AM, Vincent Guittot wrote:

Hi,

Le Thursday 18 Jan 2018 Ã 10:38:07 (+0000), Morten Rasmussen a Ãcrit :

On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:

Le Wednesday 03 Jan 2018 Ã 10:16:00 (+0100), Vincent Guittot a Ãcrit :

Hi Peter,

On 22 December 2017 at 21:42, Peter Zijlstra <peterz@xxxxxxxxxxxxx>
wrote:

On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:

Right; but I figured we'd try and do it 'right' and see how horrible
it
is before we try and do funny things.

So now it should have a 32ms tick for up to .5s when the system goes
completely idle.

No idea how bad that is..

I have tested your branch but the timer doesn't seem to fire correctly
because i can still see blocked load in the use case i have run.
I haven't found the reason yet

Hi Peter,

With the patch below on top of your branch, the blocked loads are
updated and
decayed regularly. The main differences are:
- It doesn't use a timer to trig ilb but the tick and when a cpu becomes
idle.
The main drawback of this solution is that the load is blocked when
the
system is fully idle with the advantage of not waking up a fully idle
system. We have to wait for the next tick or newly idle event for
updating
blocked load when the system leaves idle stat which can be up to a
tick long.
If this is too long, we can check for kicking ilb when task wakes up
so the
blocked load will be updated as soon as the system leaves idle state.
The main advantage is that we don't wake up a fully idle system every
32ms to
update blocked load that will be not used.
- I'm working on one more improvement to use nohz_idle_balance in the
newly
idle case when the system is not overloaded and
(this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we
can try to
use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
exceed
this_rq->avg_idle. This will remove some calls to kick_ilb and some
wake up
of an idle cpus.

This sound like what I meant in my other reply :-)

It seems pointless to have a timer to update PELT if the system is
completely idle, and when it isn't we can piggy back other events to
make the updates happen.

The patch below implements what has been described above. It calls part of
nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too
much
time. This removes part of ilb that are kicked on an idle cpu for updating
the blocked load but the ratio really depends on when the tick happens
compared
to a cpu becoming idle and the 32ms boundary. I have an additionnal patch
that
enables to update the blocked loads when a cpu becomes idle 1 period
before
kicking an ilb and there is far less ilb because we give more chance to
the
newly idle case (time_after is replaced by time_after_eq in
idle_balance()).

The patch also uses a function cfs_rq_has_blocked, which only checks the
util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too.
This
reduce significantly the number of update of blocked load. the *_avg will
be
fully decayed in around 300~400ms but it's far longer for the *_sum which
have
a higher resolution and we can easily reach almost seconds. But only the
*_avg
are used to make decision so keeping some blocked *_sum is acceptable.

---
kernel/sched/fair.c | 121
+++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 92 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 898785d..ed90303 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
*cfs_rq)
return true;
}
+static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
+{
+ if (cfs_rq->avg.load_avg)
+ return true;
+
+ if (cfs_rq->avg.util_avg)
+ return true;
+
+ return false;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static void update_blocked_averages(int cpu)
@@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
*/
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);
- else
+
+ /* Don't need periodic decay once load/util_avg are null
*/
+ if (cfs_rq_has_blocked(cfs_rq))
done = false;
}
@@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int
cpu)
update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
- if (cfs_rq_is_decayed(cfs_rq))
+ if (cfs_rq_has_blocked(cfs_rq))
rq->has_blocked_load = 0;
#endif
rq_unlock_irqrestore(rq, &rf);
@@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd,
unsigned long *next_balance)
*next_balance = next;
}
+static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
enum cpu_idle_type idle);
static void kick_ilb(unsigned int flags);
/*
@@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct
rq_flags *rf)
update_next_balance(sd, &next_balance);
rcu_read_unlock();
- if (time_after(jiffies, next) &&
atomic_read(&nohz.stats_state))
+ /*
+ * Update blocked idle load if it has not been done for a
+ * while. Try to do it locally before entering idle but
kick a
+ * ilb if it takes too much time and might delay next
local
+ * wake up
+ */
+ if (time_after(jiffies, next) &&
atomic_read(&nohz.stats_state) &&
+ !_nohz_idle_balance(this_rq,
NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
kick_ilb(NOHZ_STATS_KICK);
goto out;
@@ -9237,6 +9258,7 @@ void nohz_balance_enter_idle(int cpu)
if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
return;
+ rq->has_blocked_load = 1;
if (rq->nohz_tick_stopped)
return;
@@ -9247,7 +9269,6 @@ void nohz_balance_enter_idle(int cpu)
return;
rq->nohz_tick_stopped = 1;
- rq->has_blocked_load = 1;
cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
atomic_inc(&nohz.nr_cpus);
@@ -9259,7 +9280,6 @@ void nohz_balance_enter_idle(int cpu)
* enable the periodic update of the load of idle cpus
*/
atomic_set(&nohz.stats_state, 1);
-
}
#else
static inline void nohz_balancer_kick(struct rq *rq) { }
@@ -9385,10 +9405,13 @@ static void rebalance_domains(struct rq *rq, enum
cpu_idle_type idle)
#ifdef CONFIG_NO_HZ_COMMON
/*
- * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
- * rebalancing for all the cpus for whom scheduler ticks are stopped.
+ * Internal function that runs load balance for all idle cpus. The load
balance
+ * can be a simple update of blocked load or a complete load balance with
+ * tasks movement depending of flags.
+ * For newly idle mode, we abort the loop if it takes too much time and
return
+ * false to notify that the loop has not be completed and a ilb shoud be
kick.
*/
-static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type
idle)
+static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
enum cpu_idle_type idle)
{
/* Earliest time when we have to do rebalance again */
unsigned long now = jiffies;
@@ -9396,24 +9419,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
enum cpu_idle_type idle)
bool has_blocked_load = false;
int update_next_balance = 0;
int this_cpu = this_rq->cpu;
- unsigned int flags;
int balance_cpu;
+ int ret = false;
struct rq *rq;
-
- if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
- return false;
-
- if (idle != CPU_IDLE) {
- atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
- return false;
- }
-
- /*
- * barrier, pairs with nohz_balance_enter_idle(), ensures ...
- */
- flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
- if (!(flags & NOHZ_KICK_MASK))
- return false;
+ u64 curr_cost = 0;
SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
@@ -9428,6 +9437,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
enum cpu_idle_type idle)
atomic_set(&nohz.stats_state, 0);
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
+ u64 t0, domain_cost;
+
+ t0 = sched_clock_cpu(this_cpu);
+
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;
@@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
enum cpu_idle_type idle)
*/
if (need_resched()) {
has_blocked_load = true;
- break;
+ goto abort;
+ }
+
+ /*
+ * If the update is done while CPU becomes idle, we abort
+ * the update when its cost is higher than the average
idle
+ * time in orde to not delay a possible wake up.
+ */
+ if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle <
curr_cost) {
+ has_blocked_load = true;
+ goto abort;
}
rq = cpu_rq(balance_cpu);
@@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
enum cpu_idle_type idle)
if (time_after_eq(jiffies, rq->next_balance)) {
struct rq_flags rf;
- rq_lock_irq(rq, &rf);
+ rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
cpu_load_update_idle(rq);
- rq_unlock_irq(rq, &rf);
+ rq_unlock_irqrestore(rq, &rf);
if (flags & NOHZ_BALANCE_KICK)
rebalance_domains(rq, CPU_IDLE);
@@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
enum cpu_idle_type idle)
next_balance = rq->next_balance;
update_next_balance = 1;
}
+
+ domain_cost = sched_clock_cpu(this_cpu) - t0;
+ curr_cost += domain_cost;
+
}
- update_blocked_averages(this_cpu);
- has_blocked_load |= this_rq->has_blocked_load;
+ /* Newly idle CPU doesn't need an update */
+ if (idle != CPU_NEWLY_IDLE) {
+ update_blocked_averages(this_cpu);
+ has_blocked_load |= this_rq->has_blocked_load;
+ }
if (flags & NOHZ_BALANCE_KICK)
rebalance_domains(this_rq, CPU_IDLE);
@@ -9477,6 +9507,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
enum cpu_idle_type idle)
WRITE_ONCE(nohz.next_stats,
now + msecs_to_jiffies(LOAD_AVG_PERIOD));
+ /* The full idle balance loop has been done */
+ ret = true;
+
+abort:
/* There is still blocked load, enable periodic update */
if (has_blocked_load)
atomic_set(&nohz.stats_state, 1);
@@ -9489,6 +9523,35 @@ static bool nohz_idle_balance(struct rq *this_rq,
enum cpu_idle_type idle)
if (likely(update_next_balance))
nohz.next_balance = next_balance;
+ return ret;
+}
+
+/*
+ * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
+ * rebalancing for all the cpus for whom scheduler ticks are stopped.
+ */
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type
idle)
+{
+ int this_cpu = this_rq->cpu;
+ unsigned int flags;
+
+ if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
+ return false;
+
+ if (idle != CPU_IDLE) {
+ atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+ return false;
+ }
+
+ /*
+ * barrier, pairs with nohz_balance_enter_idle(), ensures ...
+ */
+ flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
+ if (!(flags & NOHZ_KICK_MASK))
+ return false;
+
+ _nohz_idle_balance(this_rq, flags, idle);
+
return true;
}
#else