Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to3.6-rc5 on AMD chipsets - bisected

From: Borislav Petkov
Date: Wed Sep 26 2012 - 17:37:33 EST


On Wed, Sep 26, 2012 at 11:19:42AM -0700, Linus Torvalds wrote:
> I'm *so* not surprised.
>
> That said, I think your "kill select_idle_sibling()" one was
> interesting, but the wrong kind of "get rid of that logic".

Yeah.

> It always selected target_cpu, but the fact is, that doesn't really
> sound very sane. The target cpu is either the previous cpu or the
> current cpu, depending on whether they should be balanced or not. But
> that still doesn't make any *sense*.
>
> In fact, the whole select_idle_sibling() logic makes no sense
> what-so-ever to me. It seems to be total garbage.
>
> For example, it starts with the maximum target scheduling domain, and
> works its way in over the scheduling groups within that domain. What
> the f*ck is the logic of that kind of crazy thing? It never makes
> sense to look at a biggest domain first. If you want to be close to
> something, you want to look at the *smallest* domain first. But
> because it looks at things in the wrong order, it then needs to have
> that inner loop saying "does this group actually cover the cpu I am
> interested in?"
>
> Please tell me I am mis-reading this?

First of all, I'm so *not* a scheduler guy so take this with a great
pinch of salt.

The way I understand it is, you either want to share L2 with a process,
because, for example, both working sets fit in the L2 and/or there's
some sharing which saves you moving everything over the L3. This is
where selecting a core on the same L2 is actually a good thing.

Or, they're too big to fit into the L2 and they start kicking each-other
out. Then you want to spread them out to different L2s - i.e., different
HT groups in Intel-speak.

Oh, and then there's the userspace spinlocks thingie where Mike's patch
hurts us.

Btw, Mike, you can jump in anytime :-)

So I'd say, this is the hard scheduling problem where fitting the
workload to the architecture doesn't make everyone happy.

A crazy thought: one could go and sample tasks while running their
timeslices with the perf counters to know exactly what type of workload
we're looking at. I.e., do I have a large number of L2 evictions? Yes,
then spread them out. No, then select the other core on the L2. And so
on.

> But starting from the biggest ("llc" group) is wrong *anyway*, since
> it means that it starts looking at the L3 level, and then if it
> finds an acceptable cpu inside that level, it's all done. But that's
> *crazy*. Once again, it's much better to try to find an idle sibling
> *closeby* rather than at the L3 level. No?

Exactly my thoughts a couple of days ago but see above.

> So once again, we should start at the inner level and if we can't find
> something really close, we work our way out, rather than starting from
> the outer level and working our way in.
>
> If I read the code correctly, we can have both "prev" and "cpu" in
> the same L2 domain, but because we start looking at the L3 domain, we
> may end up picking another "affine" CPU that isn't even sharing L2's
> *before* we pick one that actually *is* sharing L2's with the target
> CPU. But that code is confusing enough with the scheduler groups inner
> loop that maybe I am mis-reading it entirely.
>
> There are other oddities in select_idle_sibling() too, if I read
> things correctly.
>
> For example, it uses "cpu_idle(target)", but if we're actively trying
> to move to the current CPU (ie wake_affine() returned true), then
> target is the current cpu, which is certainly *not* going to be idle
> for a sync wakeup. So it should actually check whether it's a sync
> wakeup and the only thing pending is that synchronous waker, no?
>
> Maybe I'm missing something really fundamental, but it all really does
> look very odd to me.
>
> Attached is a totally untested and probably very buggy patch, so
> please consider it a "shouldn't we do something like this instead" RFC
> rather than anything serious. So this RFC patch is more a "ok, the
> patch tries to fix the above oddnesses, please tell me where I went
> wrong" than anything else.
>
> Comments?

Let me look at it tomorrow, on a fresh head. Too late here now.

Thanks.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/