Re: newidle balancing in NUMA domain?

From: Ingo Molnar
Date: Mon Nov 23 2009 - 07:09:03 EST



* Nick Piggin <npiggin@xxxxxxx> wrote:

> On Mon, Nov 23, 2009 at 12:45:50PM +0100, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > > On Mon, 2009-11-23 at 12:22 +0100, Nick Piggin wrote:
> > > > Hi,
> > > >
> > > > I wonder why it was decided to do newidle balancing in the NUMA
> > > > domain? And with newidle_idx == 0 at that.
> > > >
> > > > This means that every time the CPU goes idle, every CPU in the
> > > > system gets a remote cacheline or two hit. Not very nice O(n^2)
> > > > behaviour on the interconnect. Not to mention trashing our
> > > > NUMA locality.
> > > >
> > > > And then I see some proposal to do ratelimiting of newidle
> > > > balancing :( Seems like hack upon hack making behaviour much more
> > > > complex.
> > > >
> > > > One "symptom" of bad mutex contention can be that increasing the
> > > > balancing rate can help a bit to reduce idle time (because it
> > > > can get the woken thread which is holding a semaphore to run ASAP
> > > > after we run out of runnable tasks in the system due to them
> > > > hitting contention on that semaphore).
> > > >
> > > > I really hope this change wasn't done in order to help -rt or
> > > > something sad like sysbench on MySQL.
> > >
> > > IIRC this was kbuild and other spreading workloads that want this.
> > >
> > > the newidle_idx=0 thing is because I frequently saw it make funny
> > > balance decisions based on old load numbers, like f_b_g() selecting a
> > > group that didn't even have tasks in anymore.
> > >
> > > We went without newidle for a while, but then people started
> > > complaining about that kbuild time, and there is a x264 encoder thing
> > > that looses tons of throughput.
> >
> > Yep, i too reacted in a similar way to Nick initially - but i think you
> > are right, we really want good, precise metrics and want to be
> > optional/fuzzy in our balancing _decisions_, not in our metrics.
>
> Well to be fair, the *decision* is to use a longer-term weight for the
> runqueue to reduce balancing (seeing as we naturally do far more
> balancing on conditions means that we tend to look at our instant
> runqueue weight when it is 0).

Well, the problem with that is that it uses a potentially outdated piece
of metric - and that can become visible if balancing events are rare
enough.

I.e. we do need a time scale (rate of balancing) to be able to do this
correctly on a statistical level - which pretty much brings in 'rate
limit' kind of logic.

We are better off observing reality precisely and then saying "dont do
this action" instead of fuzzing our metrics [or using fuzzy metrics
conditionally - which is really the same] and hoping that in the end it
will be as if we didnt do certain decisions.

(I hope i explained my point clearly enough.)

No argument that it could be done cleaner - the duality right now of
both having the fuzzy stats and the rate limiting should be decided
one way or another.

Also, no argument that if you can measure bad effects from this change
on any workload we need to look at that and fix it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/