Microbenchmarking userspace synchronisation primitives

From: Brice Arnould
Date: Wed Jan 05 2011 - 18:31:24 EST


Hello,

I was microbenchmarking synchronisation primitives, and got very
surprising results. Among other things, it seems to be cheaper to
synchronize two processes (with a pipe) than it is to synchronize two
threads of the same process. Also the benchmark runs faster when it's
constrained to a single CPU with sched_setaffinity() which I think
might indicate a suboptimal scheduling.
The experience is as follow : A first process (or thread) transmits an
integer to a second process (or thread) via a pipe (or via shared
memory). This second process (or thread) answers back with the integer
multiplied by two. The first process (or thread) then checks that the
multiplication was correct.
A single run of that experience is called an exchange. The
microbenchmark consists in many exchanges.

The average number of exchanges per second is as follow :
- Processes synchronized by a pipe : 059171 exchanges/second
- Threads synchronized by a pipe : 031746 exchanges/second
- (Threads synchronized by a pthread_barrier: 017825 exchanges/second)

Kernel : 2.6.35-24-generic #42-Ubuntu SMP
CPU : Intel Core 2 Duo CPU T8100 @ 2.10GHz (in 32 bit mode)
Code : git://github.com/unbrice/20110103_lkml_bench.git
Build with "make"
Run like "./rpc-piped_thread 1000000 1"
or "./rpc-piped_process 1000000 1"
libc : Glibc 2.12.1-0ubuntu10

I was expecting that performance would vary with the different
synchronization primitives. And it seems plausible that
pthread_barrier being more "generic" (because it allows for
synchronizing more than one thread) it should be slower.
I was however surprised to see that synchronizing two threads of a
single process via a pipe is slower than synchronizing two processes
via a pipe. If I'm not misinterpreting perf(1), almost all the time is
spent in try_to_wake_up so the cause might be there.

While trying to find an explanation for this behavior, I forced the
threads and processes to run on a CPU of my choice using
sched_setaffinity().
When forcing the threads and process on different CPUs, the results
were as before.
When forcing the threads and process on a single CPU, performance
increased and processes stayed ahead of threads :
- Processes synchronized by a pipe : 401606 exchanges/second
- Threads synchronized by a pipe : 177304 exchanges/second
- (Threads synchronized by a pthread_barrier: 087032 exchanges/second)

If I'm not misinterpreting perf(1), those difference might be related
to a decreasing number of cache misses (from ~25k to ~13k).

Those numbers defy my understanding of what should be happening, which
is why I'm humbly asking for a hint or a pointer to an explanation.
I'm also hoping that the possible scheduler's suboptimal behavior can
be worked upon by competent gurus, but acknowledge that this may not
be usefull because we are just speaking of a microbenchmark.
Also I think that I am going to be in similar situation with real
application (AIs in the context of an AI contest). Do you have any
suggestion wrt performances of that kind of applications ?

Thank you in advance,
Brice
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/