[PATCH] sched: call select_idle_sibling when not affine_sd

From: Rik van Riel
Date: Wed May 14 2014 - 11:41:22 EST


On Wed, 14 May 2014 06:08:09 +0200
Mike Galbraith <umgwanakikbuti@xxxxxxxxx> wrote:
> On Tue, 2014-05-13 at 10:08 -0400, Rik van Riel wrote:

> > 1) If the node_distance between nodes on a NUMA system is
> > <= RECLAIM_DISTANCE, we will call select_idle_sibling for
> > a wakeup of a previously existing task (SD_BALANCE_WAKE)
> >
> > 2) If the node distance exceeds RECLAIM_DISTANCE, we will
> > wake up a task on prev_cpu, even if it is not currently
> > idle
> >
> > This behaviour only happens on certain large NUMA systems,
> > and is different from the behaviour on small systems.
> > I suspect we will want to call select_idle_sibling with
> > prev_cpu in case target and prev_cpu are not in the same
> > SD_WAKE_AFFINE domain.
>
> Sometimes. It's the same can of worms remote as it is local.. latency
> gain may or may not outweigh cache miss pain.

Ahh, but it is a DIFFERENT can of worms. If the distance between
cpu and prev_cpu exceeds RECLAIM_DISTANCE, we will not look for
an idle sibling in the same LLC domain as prev_cpu.

If the distance is smaller, and we decide not to do an affine
wakeup, then we do look for an idle sibling of prev_cpu.

This patch makes sure that both types of systems have the same
can of worms :)

---8<---

Subject: sched: call select_idle_sibling when not affine_sd

On smaller systems, the top level sched domain will be an affine
domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE
wakeup. This seems to be working well.

On larger systems, with the node distance between far away NUMA nodes
being > RECLAIM_DISTANCE, select_idle_sibling is only called if the
waker and the wakee are on nodes less than RECLAIM_DISTANCE apart.

This patch leaves in place the policy of not pulling the task across
nodes on such systems, while fixing the issue that select_idle_sibling
is not called at all in certain circumstances.

The code will look for an idle CPU in the same CPU package as the
CPU where the task ran previously.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
---
kernel/sched/fair.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39b63d0..1e58159 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4423,10 +4423,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
sd = tmp;
}

- if (affine_sd) {
- if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ prev_cpu = cpu;

+ if (sd_flag & SD_WAKE_AFFINE) {
new_cpu = select_idle_sibling(p, prev_cpu);
goto unlock;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/