Re: [PATCH 3/3] sched: Update ->next_balance correctly during newidle balance

From: Vincent Guittot
Date: Fri Oct 20 2023 - 09:40:30 EST


On Fri, 20 Oct 2023 at 03:40, Joel Fernandes (Google)
<joel@xxxxxxxxxxxxxxxxx> wrote:
>
> From: "Vineeth Pillai (Google)" <vineeth@xxxxxxxxxxxxxxx>
>
> When newidle balancing triggers, we see that it constantly clobbers
> rq->next_balance even when there is no newidle balance happening due to
> the cost estimates. Due to this, we see that periodic load balance
> (rebalance_domains) may trigger way more often when the CPU is going in
> and out of idle at a high rate but is no really idle. Repeatedly
> triggering load balance there is a bad idea as it is a heavy operation.
> It also causes increases in softirq.

we have 2 balance intervals:
- one when idle based on the sd->balance_interval = sd_weight
- one when busy which increases the period by multiplying it with
busy_factor = 16

When becoming idle, the rq->next_balance can have been set using the
16*sd_weight period so load_balance can wait for a long time before
running idle load balance for this cpu.
As a typical example, instead of waiting at most 8ms, we will wait
128ms before we try to pull a task on the idle CPU.

That's the reason for updating rq->next_balance in newidle_balance()

>
> Another issue is ->last_balance is not updated after newidle balance
> causing mistakes in the ->next_balance calculations.

newly idle load balance is not equal to idle load balance. It's a
light load balance trying to pull one task and you can't really
consider it to the normal load balance

>
> Fix by updating last_balance when a newidle load balance actually
> happens and then updating next_balance. This is also how it is done in
> other load balance paths.
>
> Testing shows a significant drop in softirqs when running:
> cyclictest -i 100 -d 100 --latency=1000 -D 5 -t -m -q
>
> Goes from ~6k to ~800.

Even if your figures look interesting, your patch adds regression in
the load balance and the fairness.

We can probably do improve the current behavior for decreasing number
of ILB but your proposal is not the right solution IMO

>
> Cc: Suleiman Souhlal <suleiman@xxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Frederic Weisbecker <frederic@xxxxxxxxxx>
> Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@xxxxxxxxxxxxxxx>
> Co-developed-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 8 ++------
> 1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8e276d12c3cb..b147ad09126a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12076,11 +12076,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> if (!READ_ONCE(this_rq->rd->overload) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> -
> - if (sd)
> - update_next_balance(sd, &next_balance);
> rcu_read_unlock();
> -
> goto out;
> }
> rcu_read_unlock();
> @@ -12095,8 +12091,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> int continue_balancing = 1;
> u64 domain_cost;
>
> - update_next_balance(sd, &next_balance);
> -
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
> break;
>
> @@ -12109,6 +12103,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> t1 = sched_clock_cpu(this_cpu);
> domain_cost = t1 - t0;
> update_newidle_cost(sd, domain_cost);
> + sd->last_balance = jiffies;
> + update_next_balance(sd, &next_balance);
>
> curr_cost += domain_cost;
> t0 = t1;
> --
> 2.42.0.655.g421f12c284-goog
>