Re: [patch 2/2] sched: Scale the nohz_tracker logic by making itper NUMA node

From: Peter Zijlstra
Date: Mon Dec 21 2009 - 08:12:44 EST


On Thu, 2009-12-10 at 17:27 -0800, venkatesh.pallipadi@xxxxxxxxx wrote:
> plain text document attachment
> (0002-sched-Scale-the-nohz_tracker-logic-by-making-it-per.patch)
> Having one idle CPU doing the rebalancing for all the idle CPUs in
> nohz mode does not scale well with increasing number of cores and
> sockets. Make the nohz_tracker per NUMA node. This results in multiple
> idle load balancing happening at NUMA node level and idle load balancer
> only does the rebalance domain among all the other nohz CPUs in that
> NUMA node.
>
> This addresses the below problem with the current nohz ilb logic
> * The lone balancer may end up spending a lot of time doing the
> * balancing on
> behalf of nohz CPUs, especially with increasing number of sockets and
> cores in the platform.

Right, so I think the whole NODE idea here is wrong, it all seems to
work out properly if you simply pick one sched domain larger than the
one that contains all of the current socket and contains an idle unit.

Except that the sched domain stuff is not properly aware of bigger
topology things atm.

The sched domain tree should not view node as the largest structure and
we should remove that current random node split crap we have.

Instead the sched domains should continue to express the topology, like
nodes within 1 hop, nodes within 2 hops, etc.

Then this nohz idle balancing should pick the socket level (which might
be larger than the node level), and walks up the domain tree, until we
reach a level where it has a whole idle group.

This means that we'll always span at least 2 sockets, which means we'll
gracefully deal with the overload scenario.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/