Re: Unusually high system CPU usage with recent kernels

From: Tibor Billes
Date: Wed Aug 21 2013 - 17:06:01 EST


> From: Paul E. McKenney Sent: 08/21/13 09:12 PM
> On Wed, Aug 21, 2013 at 08:14:46PM +0200, Tibor Billes wrote:
> > > From: Paul E. McKenney Sent: 08/20/13 11:43 PM
> > > On Tue, Aug 20, 2013 at 10:52:26PM +0200, Tibor Billes wrote:
> > > > > From: Paul E. McKenney Sent: 08/20/13 04:53 PM
> > > > > On Tue, Aug 20, 2013 at 08:01:28AM +0200, Tibor Billes wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I was using the 3.9.7 stable release and tried to upgrade to the 3.10.x series.
> > > > > > The 3.10.x series was showing unusually high (>75%) system CPU usage in some
> > > > > > situations, making things really slow. The latest stable I tried is 3.10.7.
> > > > > > I also tried 3.11-rc5, they both show this behaviour. This behaviour doesn't
> > > > > > show up when the system is idling, only when doing some CPU intensive work,
> > > > > > like compiling with multiple threads. Compiling with only one thread seems not
> > > > > > to trigger this behaviour.
> > > > > >
> > > > > > To be more precise I did a `perf record -a` while compiling a large C++ program
> > > > > > with scons using 4 threads, the result is appended at the end of this email.
> > > > >
> > > > > New one on me! You are running a mainstream system (x86_64), so I am
> > > > > surprised no one else noticed.
> > > > >
> > > > > Could you please send along your .config file?
> > > >
> > > > Here it is
> > >
> > > Interesting. I don't see RCU stuff all that high on the list, but
> > > the items I do see lead me to suspect RCU_FAST_NO_HZ, which has some
> > > relevance to the otherwise inexplicable group of commits you located
> > > with your bisection. Could you please rerun with CONFIG_RCU_FAST_NO_HZ=n?
> > >
> > > If that helps, there are some things I could try.
> >
> > It did help. I didn't notice anything unusual when running with CONFIG_RCU_FAST_NO_HZ=n.
>
> Interesting. Thank you for trying this -- and we at least have a
> short-term workaround for this problem. I will put a patch together
> for further investigation.

I don't specifically need this config option so I'm fine without it in
the long term, but I guess it's not supposed to behave like that.

> In the meantime, could you please tell me how you were measuring
> performance for your kernel builds? Wall-clock time required to complete
> one build? Number of builds completed per unit time? Something else?

Actually, I wasn't all this sophisticated. I have a system monitor
applet on my top panel (using MATE, Linux Mint), four little graphs,
one of which shows CPU usage. Different colors indicate different kind
of CPU usage. Blue shows user space usage, red shows system usage, and
two more for nice and iowait. During a normal compile it's almost
completely filled with blue user space CPU usage, only the top few
pixels show some iowait and system usage. With CONFIG_RCU_FAST_NO_HZ
set, about 3/4 of the graph was red system CPU usage, the rest was
blue. So I just looked for a pile of red on my graphs when I tested
different kernel builds. But also compile speed was horrible I couldn't
wait for the build to finish. Even the UI got unresponsive.

Now I did some measuring. In the normal case a compile finished in 36
seconds, compiled 315 object files. Here are some output lines from
dstat -tasm --vm during compile:
----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ----swap--- ------memory-usage----- -----virtual-memory----
  time   |usr sys idl wai hiq siq| read Âwrit| recv Âsend| Âin  out | int  csw | used Âfree| used Âbuff Âcach Âfree|majpf minpf alloc Âfree
21-08 21:48:05| 91  8  2  0  0  0|  0 Â5852k|  0   0 |  0   0 |1413 Â1772 |  0 Â7934M| 581M 58.0M Â602M 6553M|  0  Â71k  46k  54k
21-08 21:48:06| 93  6  1  0  0  0|  0 Â2064k| 137B Â131B|  0   0 |1356 Â1650 |  0 Â7934M| 649M 58.0M Â604M 6483M|  0  Â72k  47k  28k
21-08 21:48:07| 86 Â11  4  0  0  0|  0 Â5872k|  0   0 |  0   0 |2000 Â2991 |  0 Â7934M| 577M 58.0M Â627M 6531M|  0  Â99k  67k  79k
21-08 21:48:08| 87  9  3  0  0  0|  0 Â2840k|  0   0 |  0   0 |2558 Â4164 |  0 Â7934M| 597M 58.0M Â632M 6507M|  0  Â96k  57k  51k
21-08 21:48:09| 93  7  1  0  0  0|  0 Â3032k|  0   0 |  0   0 |1329 Â1512 |  0 Â7934M| 641M 58.0M Â626M 6469M|  0  Â61k  48k  39k
21-08 21:48:10| 93  6  0  0  0  0|  0 Â4984k|  0   0 |  0   0 |1160 Â1146 |  0 Â7934M| 572M 58.0M Â628M 6536M|  0  Â50k  40k  57k
21-08 21:48:11| 86  9  6  0  0  0|  0 Â2520k|  0   0 |  0   0 |2947 Â4760 |  0 Â7934M| 605M 58.0M Â631M 6500M|  0  103k  55k  45k
21-08 21:48:12| 90  8  2  0  0  0|  0 Â2840k|  0   0 |  0   0 |2674 Â4179 |  0 Â7934M| 671M 58.0M Â635M 6431M|  0  Â84k  59k  42k
21-08 21:48:13| 90  9  1  0  0  0|  0 Â4656k|  0   0 |  0   0 |1223 Â1410 |  0 Â7934M| 643M 58.0M Â638M 6455M|  0  Â90k  62k  68k
21-08 21:48:14| 91  8  1  0  0  0|  0 Â3572k|  0   0 |  0   0 |1432 Â1828 |  0 Â7934M| 647M 58.0M Â641M 6447M|  0  Â81k  59k  57k
21-08 21:48:15| 91  8  1  0  0  0|  0 Â5116k| 116B  Â0 |  0   0 |1194 Â1295 |  0 Â7934M| 605M 58.0M Â644M 6487M|  0  Â69k  54k  64k
21-08 21:48:16| 87 Â10  3  0  0  0|  0 Â5140k|  0   0 |  0   0 |1761 Â2586 |  0 Â7934M| 584M 58.0M Â650M 6502M|  0  105k  64k  68k

The abnormal case compiled only 182 object file in 6 and a half minutes,
then I stopped it. The same dstat output for this case:
----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ----swap--- ------memory-usage----- -----virtual-memory----
  time   |usr sys idl wai hiq siq| read Âwrit| recv Âsend| Âin  out | int  csw | used Âfree| used Âbuff Âcach Âfree|majpf minpf alloc Âfree
21-08 22:10:49| 27 Â62 Â 0 Â 0 Â11 Â 0| Â 0 Â Â 0 | 210B Â Â0 | Â 0 Â Â 0 |1414 Â3137k| Â 0 Â7934M| 531M 57.6M Â595M 6611M| Â 0 Â1628 Â1250 Â 322
21-08 22:10:50| 25 Â60 Â 4 Â 0 Â11 Â 0| Â 0 Â Â88k| 126B Â Â0 | Â 0 Â Â 0 |1337 Â3110k| Â 0 Â7934M| 531M 57.6M Â595M 6611M| Â 0 Â Â91 Â 128 Â 115
21-08 22:10:51| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â 184k| 294B Â Â0 | Â 0 Â Â 0 |1411 Â3147k| Â 0 Â7934M| 531M 57.6M Â595M 6611M| Â 0 Â1485 Â 814 Â 815
21-08 22:10:52| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â Â 0 | 437B Â239B| Â 0 Â Â 0 |1355 Â3160k| Â 0 Â7934M| 531M 57.6M Â595M 6611M| Â 0 Â Â24 Â Â94 Â Â97
21-08 22:10:53| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â Â 0 | 168B Â Â0 | Â 0 Â Â 0 |1397 Â3155k| Â 0 Â7934M| 531M 57.6M Â595M 6611M| Â 0 Â 479 Â 285 Â 273
21-08 22:10:54| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â4096B| 396B Â324B| Â 0 Â Â 0 |1346 Â3154k| Â 0 Â7934M| 531M 57.6M Â595M 6611M| Â 0 Â Â27 Â 145 Â 145
21-08 22:10:55| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â Â60k| Â 0 Â Â 0 | Â 0 Â Â 0 |1353 Â3148k| Â 0 Â7934M| 531M 57.6M Â595M 6610M| Â 0 Â Â93 Â 117 Â Â36
21-08 22:10:56| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â Â 0 | Â 0 Â Â 0 | Â 0 Â Â 0 |1341 Â3172k| Â 0 Â7934M| 531M 57.6M Â595M 6610M| Â 0 Â 158 Â Â87 Â Â74
21-08 22:10:57| 26 Â62 Â 1 Â 0 Â11 Â 0| Â 0 Â Â 0 | Â42B Â 60B| Â 0 Â Â 0 |1332 Â3162k| Â 0 Â7934M| 531M 57.6M Â595M 6610M| Â 0 Â Â56 Â Â82 Â Â78
21-08 22:10:58| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â Â 0 | Â 0 Â Â 0 | Â 0 Â Â 0 |1334 Â3178k| Â 0 Â7934M| 531M 57.6M Â595M 6610M| Â 0 Â Â26 Â Â56 Â Â56
21-08 22:10:59| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â Â 0 | Â 0 Â Â 0 | Â 0 Â Â 0 |1336 Â3179k| Â 0 Â7934M| 531M 57.6M Â595M 6610M| Â 0 Â Â 3 Â Â33 Â Â32
21-08 22:11:00| 26 Â63 Â 0 Â 0 Â11 Â 0| Â 0 Â Â24k| Â90B Â108B| Â 0 Â Â 0 |1347 Â3172k| Â 0 Â7934M| 531M 57.6M Â595M 6610M| Â 0 Â Â41 Â Â73 Â Â71

I have four logical cores so 25% makes up 1 core. I don't know if the ~26% user CPU usage has anthing to do with this fact or just coincidence. The rest is ~63% system and ~11% hardware interrupt. Do these support what you suspect?

Tibor
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/