Re: [patch 2/2] sched: dynticks idle load balancing - v2

From: Andrew Morton
Date: Wed Feb 21 2007 - 15:24:14 EST


On Fri, 16 Feb 2007 18:08:42 -0800
"Siddha, Suresh B" <suresh.b.siddha@xxxxxxxxx> wrote:

> Changes since v1:
>
> - Move the idle load balancer selection from schedule()
> to the first busy scheduler_tick() after restarting the tick.
> This will avoid the unnecessay ownership changes when
> softirq's(which are run in softirqd context in certain -rt
> configurations) like timer, sched are invoked for every idle tick
> that happens.
>
> - Misc cleanups.
>
> ---
> Fix the process idle load balancing in the presence of dynticks.
> cpus for which ticks are stopped will sleep till the next event wakes it up.
> Potentially these sleeps can be for large durations and during which today,
> there is no periodic idle load balancing being done.
>
> This patch nominates an owner among the idle cpus, which does the idle load
> balancing on behalf of the other idle cpus. And once all the cpus are
> completely idle, then we can stop this idle load balancing too. Checks added
> in fast path are minimized. Whenever there are busy cpus in the system, there
> will be an owner(idle cpu) doing the system wide idle load balancing.
>
> Open items:
> 1. Intelligent owner selection (like an idle core in a busy package).
> 2. Merge with rcu's nohz_cpu_mask?
>

I spose I'll hold my nose and merge this, but it creates too much of a mess
to be mergeable into the CPU scheduler, IMO.

Can we please do something to reduce the ifdef density? And if possible,
all the newly-added returns-from-the-middle-of-a-function?


> +#ifdef CONFIG_NO_HZ
> +static struct {
> + int load_balancer;
> + cpumask_t cpu_mask;
> +} nohz ____cacheline_aligned = {
> + .load_balancer = -1,
> + .cpu_mask = CPU_MASK_NONE,
> +};
> +
> +/*
> + * This routine will try to nominate the ilb (idle load balancing)
> + * owner among the cpus whose ticks are stopped. ilb owner will do the idle
> + * load balancing on behalf of all those cpus. If all the cpus in the system
> + * go into this tickless mode, then there will be no ilb owner (as there is
> + * no need for one) and all the cpus will sleep till the next wakeup event
> + * arrives...
> + *
> + * For the ilb owner, tick is not stopped. And this tick will be used
> + * for idle load balancing. ilb owner will still be part of
> + * nohz.cpu_mask..
> + *
> + * While stopping the tick, this cpu will become the ilb owner if there
> + * is no other owner. And will be the owner till that cpu becomes busy
> + * or if all cpus in the system stop their ticks at which point
> + * there is no need for ilb owner.
> + *
> + * When the ilb owner becomes busy, it nominates another owner, during the
> + * next busy scheduler_tick()
> + */
> +int select_nohz_load_balancer(int stop_tick)
> +{
> + int cpu = smp_processor_id();
> +
> + if (stop_tick) {
> + cpu_set(cpu, nohz.cpu_mask);
> + cpu_rq(cpu)->in_nohz_recently = 1;
> +
> + /*
> + * If we are going offline and still the leader, give up!
> + */
> + if (cpu_is_offline(cpu) && nohz.load_balancer == cpu) {
> + if (cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)

So we require that architectures which implement CONFIG_NO_HZ also
implement cmpxchg.

> + BUG();
> + return 0;
> + }
> +
> + /* time for ilb owner also to sleep */
> + if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
> + if (nohz.load_balancer == cpu)
> + nohz.load_balancer = -1;
> + return 0;
> + }
> +
> + if (nohz.load_balancer == -1) {
> + /* make me the ilb owner */
> + if (cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
> + return 1;
> + } else if (nohz.load_balancer == cpu)
> + return 1;
> + } else {
> + if (!cpu_isset(cpu, nohz.cpu_mask))
> + return 0;
> +
> + cpu_clear(cpu, nohz.cpu_mask);
> +
> + if (nohz.load_balancer == cpu)
> + if (cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
> + BUG();
> + }
> + return 0;
> +}
> +#endif
> +
> /*
> * run_rebalance_domains is triggered when needed from the scheduler tick.
> *
> @@ -3347,15 +3437,46 @@ static DEFINE_SPINLOCK(balancing);
>
> static void run_rebalance_domains(struct softirq_action *h)
> {
> - int this_cpu = smp_processor_id(), balance = 1;
> - struct rq *this_rq = cpu_rq(this_cpu);
> - unsigned long interval;
> + int balance_cpu = smp_processor_id(), balance;
> + struct rq *balance_rq = cpu_rq(balance_cpu);
> + unsigned long interval, next_balance;

One definition per line is preferred.

> struct sched_domain *sd;
> - enum idle_type idle = this_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
> + enum idle_type idle;
> +
> +#ifdef CONFIG_NO_HZ
> + cpumask_t cpus = nohz.cpu_mask;
> + int local_cpu = balance_cpu;
> + struct rq *local_rq = balance_rq;
> +
> +restart:
> +
> + if (local_rq->idle_at_tick && nohz.load_balancer == local_cpu) {
> + if (need_resched())
> + return;
> +
> + balance_cpu = first_cpu(cpus);
> + if (unlikely(balance_cpu >= NR_CPUS))
> + return;
> + balance_rq = cpu_rq(balance_cpu);
> + cpu_clear(balance_cpu, cpus);
> + }
> +
> + /*
> + * We must be doing idle load balancing for some other idle cpu
> + */
> + if (local_cpu != balance_cpu)
> + idle = SCHED_IDLE;
> + else
> +#endif
> + idle = balance_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
> +

It's not just that this is an eyesore (it is) - it introduces some ongoing
maintenance problems. I expect this can be pulled out into a separate
function with no loss in quality of generated code.

And that function would be a nice site for a nice comment explaining the
design. Why do we return if need_resched()? What is the significance of
(balance_cpu >= NR_CPUS)?


> +
> /* Earliest time when we have to call run_rebalance_domains again */
> - unsigned long next_balance = jiffies + 60*HZ;
> + next_balance = jiffies + 60*HZ;
> +
> + for_each_domain(balance_cpu, sd) {
>
> - for_each_domain(this_cpu, sd) {
> if (!(sd->flags & SD_LOAD_BALANCE))
> continue;
>
> @@ -3374,7 +3495,7 @@ static void run_rebalance_domains(struct
> }
>
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> - if (load_balance(this_cpu, this_rq, sd, idle, &balance)) {
> + if (load_balance(balance_cpu, balance_rq, sd, idle, &balance)) {
> /*
> * We've pulled tasks over so either we're no
> * longer idle, or one of our SMT siblings is
> @@ -3398,7 +3519,16 @@ out:
> if (!balance)
> break;
> }
> - this_rq->next_balance = next_balance;
> + balance_rq->next_balance = next_balance;
> +
> +#ifdef CONFIG_NO_HZ
> + if (local_rq->idle_at_tick && nohz.load_balancer == local_cpu) {
> + if (time_after(next_balance, local_rq->next_balance))
> + local_rq->next_balance = next_balance;
> + if (!cpus_empty(cpus))
> + goto restart;
> + }
> +#endif
> }

A goto in an ifdef :(

Perhaps it can be removed by teaching the caller to call this function again?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/