Re: [RFC] scheduler issue & patch

From: Siddha, Suresh B
Date: Mon Jun 12 2006 - 13:33:17 EST


On Mon, Jun 12, 2006 at 05:30:42PM +0200, Gerd Hoffmann wrote:
> Hi,
>
> I'm looking into a scheduler issue with a NUMA box and scheduling
> domains. The machine is a dual-core opteron with with two nodes, i.e.
> four cpus. cpu0+1 build node0, cpu2+3 build node1.
>
> Now I have an application (benchmark) with two threads which performs
> best when the two threads are running on different nodes (probably
> because the cpus on each node share the L2 cache). The scheduler tends
> to keep threads on the local node though, wihch probably makes sense on
> most cases because local memory is faster.
>
> Ok, we have tools to give hints to the scheduler (taskset, numactl).
> The problem is it doesn't work well. I can ask the scheduler to use
> cpu1 (node0) and cpu3 (node1) only (via "taskset 0x0a"). But the
> scheduler very often schedules both threads on the same cpu :-(
>
> I think the reason is that the scheduler always checks the complete cpu
> groups when calculation the group load, without looking at
> task->cpus_allowed. So we have the effect that the scheduler walks down
> the scheduler domain tree, looks at the group for node0, looks at both
> cpu0 and cpu1, finds node0 being not overloaded due to cpu0 being idle
> and decides to keep the thread on the local node. Next it walks down
> the tree and finds it isn't allowed to use the idle cpu0. So both
> threads get scheduled to cpu1. Oops.

I don't think it is the problem with sched_balance_self(). sched_balance_self()
probably is doing the right thing based on the load that is present at the
time of fork/exec. Once the node-1 becomes idle, we expect the two threads
on node-0 cpu-1 to get distributed between the two nodes.

Perhaps the real issue is how cpu_power is calculated for node domain
on these systems. Because of the shared resources between the cpus in a node,
cpu_power for a group in node domain should be < 2 * SCHED_LOAD_SCALE..

Once this is the case, find_busiest_group() should detect the imbalance and
move one of the threads from cpu-1(node-0) to cpu-3(node-1)

> The patch attached takes the sledgehammer approach to fix it: In case
> we have a non-default cpumask in task->cpus_allowed the scheduler
> ignores all the fancy scheduling domains and simply spreads the load
> equally over the cpus allowed by task->cpus_allowed. Not exactly
> elegant, but works. Not each time, but very often.
>
> Comments? Ideas how to solve this better? I've also tried to play with
> the group load calculation, but it didn't work well. I'm kida lost in
> all those scheduler tuning knobs ...

In my opinion, this patch is not the correct fix for the issue.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/