Re: VolanoMark regression with 2.6.27-rc1

From: Zhang, Yanmin
Date: Thu Aug 28 2008 - 23:34:53 EST



On Thu, 2008-08-21 at 14:48 +0800, Zhang, Yanmin wrote:
> On Thu, 2008-08-21 at 08:16 +0200, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
> >
> > > > ok, i've applied this one to tip/sched/urgent instead of the
> > > > feature-disabling patchlet. Yanmin, could you please check whether this
> > > > one does the trick?
> > >
> > > This new patch almost doesn't help volanoMark. Pls. use the patch
> > > which sets ïLB_BIAS=1 by default.
> >
> > ok. That also removes the kernel.h complications ;-)
> Sorry, I have new update.
> Originally, I worked on 2.6.27-rc1. I just move to 2.6.27-rc3 and found
> something defferent when CONFIG_GROUP_SCHED=n.
>
> With 2.6.27-rc3, on my 8-core stoakley, all volanoMark regression disappears,
> no matter if I enable LB_BIAS. On 16-core tigerton, the regression is still
> there if I don't enable LB_BIAS and regression becomes 11% from 65%.
I have new updates on this regression. I checked volanoMark web page and
found the client command line has option rooms and users. rooms means how many
chat room will be started. users means how many users are in 1 room. The default
rooms is 10 and users is 20, so every room has about 800 threads. As all threads of a
room just communicate within this room, so the rooms number is important.

All my previous volanoMark testing uses default rooms 10 and users 20. With wake_offine
in kernel, waker/sleeper will be moved to the same cpu gradually. However, if the
rooms is not multiple of cpu number, due to load balance, kernel will move threads from
one cpu to another cpu continually. If there are too many threads to weaken the cache-hot
effect, load balance is more important. But if there are not too many threads running,
cache-hot is more important than load balance. Should we prefer to wake_affine more?

Below is some data I collected with numerous testing on 3 machines.


On 2-quadcore processor stoakley (8-core):
kernel\rooms | 8 | 10 | 16 | 32
-------------------------------------------------------------------------------------------
2.6.26_nogroup | 385617 | 351247 | 323324 | 231934
ï-------------------------------------------------------------------------------------------
2.6.27-rc4_nogroup | 359124 | 336984 | 335180 | 235258
ï-------------------------------------------------------------------------------------------
ï2.6.26group | ïï381425 | ïï343636 | ïï312280 | ï179673
ï-------------------------------------------------------------------------------------------
2.6.27-rc4group | 212112 | 270000 | ï300188 | ï228465
-------------------------------------------------------------------------------------------

ï
On 2-quadcore+HT processor new x86_64 (8-core+HT, total 16 threads):
kernel\rooms | 10 | 16 | 24 | 32 | 64
-------------------------------------------------------------------------
2.6.26_nogroup | 667668 | ï671860 | ïï671662 | 621900 | ï509482
ï-------------------------------------------------------------------------
2.6.27-rc4_nogroup | ï732346 | ï800290 | ï709272 | ï648561 | ï497243
ï-------------------------------------------------------------------------
ï2.6.26group | ï705579 | ïïï759464 | ïïï693697 | ïï636019 | ï500744
ï-------------------------------------------------------------------------
2.6.27-rc4group | ïïï572426 | ï674977 | ïï627410 | ïï590984 | ï445651
-------------------------------------------------------------------------

ï
On 4-quadcore tigerton processor(16-core)(32 rooms testing isn't stable on the machine, so no 32):
kernel\rooms | 8 | 10 | 16
------------------------------------------------------------------
2.6.26_nogroup | ïï346410 | ïï382938 | ïï349405
ï------------------------------------------------------------------
2.6.27-rc4_nogroup | 359124 | 336984 | 335180
ï------------------------------------------------------------------
ï2.6.26group | ïïï504802 | ïïï376513 | ïïï319020
ï------------------------------------------------------------------
2.6.27-rc4group | ï247652 | ï284784 | ï355132
------------------------------------------------------------------

I also tried different users with rooms 8 and found the results of users 20/40/60 are very close.

With group scheduing, mostly, 2.6.26 is better than 2.6.27-rc4.
Without group scheduling, the result depends on specific machine.

I also rerun hackbench with group 10/16/32, and found the result difference between 2 kernels
varies among group 10/16/32.

What's the most reasonable group/rooms we should use to test?

In the other hand, tbench(start CPU_NUM*2 ïclients) has about 4~5% regression with 2.6.27-rc kernels.
With 30second schedstat data during the testing, I found there is almost no wake remote and wake
affine with 2.6.26, but there are many either wake_affine or wake remote with 2.6.27-rc.

-yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/