Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to3.6-rc5 on AMD chipsets - bisected

From: Ingo Molnar
Date: Thu Sep 27 2012 - 01:47:40 EST



* Mike Galbraith <efault@xxxxxx> wrote:

> I think the pgbench problem is more about latency for the 1 in
> 1:N than spinlocks.

So my understanding of the psql workload is that basically we've
got a central psql proxy process that is distributing work to
worker psql processes. If a freshly woken worker process ever
preempts the central proxy process then it is preventing a lot
of new work from getting distributed.

Correct?

So the central proxy psql process is 'much more important' to
run than any of the worker processes - an importance that is not
(currently) visible from the behavioral statistics the scheduler
keeps on tasks.

So the scheduler has the following problem here: a new wakee
might be starved enough and the proxy might have run long enough
to really justify the preemption here and now. The buddy
statistics help avoid some of these cases - but not all and the
difference is measurable.

Yet the 'best' way for psql to run is for this proxy process to
never be preempted. Your SCHED_BATCH experiments confirmed that.

The way remote CPU selection affects it is that if we ever get
more aggressive in selecting a remote CPU then we, as a side
effect, also reduce the chance of harmful preemption of the
central proxy psql process.

So in that sense sibling selection is somewhat of an indirect
red herring: it really only helps psql indirectly by preventing
the harmful preemption. It also, somewhat paradoxially argues
for suboptimal code: for example tearing apart buddies is
beneficial in the psql workload, because it also allows the more
important part of the buddy to run more (the proxy).

In that sense the *real* problem isnt even parallelism (although
we obviously should improve the decisions there - and the logic
has suffered in the past from the psql dilemma outlined above),
but whether the scheduler can (and should) identify the central
proxy and keep it running as much as possible, deprioritizing
fairness, wakeup buddies, runtime overlap and cache affinity
considerations.

There's two broad solutions that I can see:

- Add a kernel solution to somehow identify 'central' processes
and bias them. Xorg is a similar kind of process, so it would
help other workloads as well. That way lie dragons, but might
be worth an attempt or two. We already try to do a couple of
robust metrics, like overlap statistics to identify buddies.

- Let user-space occasionally identify its important (and less
important) tasks - say psql could mark it worker processes as
SCHED_BATCH and keep its central process(es) higher prio. A
single line of obvious code in 100 KLOCs of user-space code.

Just to confirm, if you turn off all preemption via a hack
(basically if you turn SCHED_OTHER into SCHED_BATCH), does psql
perform and scale much better, with the quality of sibling
selection and spreading of processes only being a secondary
effect?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/