Re: [PATCH] sched: wake up task on prev_cpu if not in SD_WAKE_AFFINE domain with cpu

From: Rik van Riel
Date: Tue May 13 2014 - 10:09:07 EST


On 05/09/2014 11:54 PM, Mike Galbraith wrote:
> On Fri, 2014-05-09 at 14:16 -0400, Rik van Riel wrote:
>
>> That leaves the big question: do we want to fall back to
>> prev_cpu if it is not idle, and it has an idle sibling,
>> or would it be better to find an idle sibling of prev_cpu
>> when we wake up a task?
>
> Yes. If there was A correct answer, this stuff would be a lot easier.

OK, after doing some other NUMA stuff, and then looking at the scheduler
again with a fresh mind, I have drawn some more conclusions about what
the scheduler does, and how it breaks NUMA locality :)

1) If the node_distance between nodes on a NUMA system is
<= RECLAIM_DISTANCE, we will call select_idle_sibling for
a wakeup of a previously existing task (SD_BALANCE_WAKE)

2) If the node distance exceeds RECLAIM_DISTANCE, we will
wake up a task on prev_cpu, even if it is not currently
idle

This behaviour only happens on certain large NUMA systems,
and is different from the behaviour on small systems.
I suspect we will want to call select_idle_sibling with
prev_cpu in case target and prev_cpu are not in the same
SD_WAKE_AFFINE domain.

3) If wake_wide is false, we call select_idle_sibling with
the CPU number of the code that is waking up the task

4) If wake_wide is true, we call select_idle_sibling with
the CPU number the task was previously running on (prev_cpu)

In effect, the "wake task on waking task's CPU" behaviour
is the default, regardless of how frequently a task wakes up
its wakee, and regardless of impact on NUMA locality.

This may need to be changed.

5) select_idle_sibling will place the task on (3) or (4) only
if the CPU is actually idle. If task A communicates with task
B through a pipe or a socket, and does a sync wakeup, task
B will never be placed on task A's CPU (not idle yet), and it
will only be placed on its own previous CPU if it is currently
idle.

6) If neither CPU is idle, select_idle_sibling will walk all the
CPUs in the SD_SHARE_PKG_RESOURCES SD of the target. This looks
correct to me, though it could result in more work by the load
balancing code later on, since it does not take load into account
at all. It is unclear if this needs any changes.

Am I overlooking anything?

What benchmarks should I run to test any changes I make?

Are there particular system types people want me to run tests with?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/