Re: High priority tasks break SMP balancer?

From: Micah Dowty
Date: Mon Nov 19 2007 - 13:51:28 EST


On Sat, Nov 17, 2007 at 08:10:35PM +0100, Dmitry Adamushko wrote:
> Micah,
>
> ok, would it be possible to get "cat /proc/schedstat" output at the
> moment when you observe the 'problem'? So we could try to analyze
> behavior of the load balancer (yeah, we should have probably started
> with this step)

No problem. Here are the two schedstat snapshots. With my priosched.c
test program running with one CPU idle, I did a "cat /proc/schedstat",
waited a couple seconds, then repeated:

micah@micah-64:~$ cat /proc/schedstat
version 14
timestamp 4366722208
cpu0 0 0 0 785868111 0 828020771 12621812 22703872 14096041 9504937730116 23603703739504 815398959
domain0 03 8766051 8760123 95 14544377 6374 0 13 8760110 7369 6771 0 20835105 605 0 1 6770 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7026998 0 24826
cpu1 0 0 0 783505304 0 860700497 26129151 41776419 34749418 10682815341474 1753703225423 834571346
domain0 03 8685595 8677710 224 7037051 8127 0 76 8677634 7500 7320 7 471762 179 0 8 7312 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 8607831 0 36753
micah@micah-64:~$ cat /proc/schedstat
version 14
timestamp 4366723231
cpu0 0 0 0 785868111 0 828021128 12621812 22703965 14096130 9509029340381 23603703795868 815399316
domain0 03 8766051 8760123 95 14544377 6374 0 13 8760110 7374 6776 0 20835105 605 0 1 6775 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7027010 0 24826
cpu1 0 0 0 783505304 0 860708521 26132612 41780521 34753508 10683770853336 1753758309667 834575909
domain0 03 8685718 8677833 224 7037051 8127 0 76 8677757 7500 7320 7 471762 179 0 8 7312 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 8607835 0 36753

I had been experimenting with both schedstat and sched_debug while
trying to identify the original problem I observed with VMware's
code. During the course of that investigation, I wrote a little tool
to graph schedstat output in real-time.

The tool itself is in a public Subversion reporitory:
http://svn.navi.cx/misc/trunk/rtgraph/schedstat-grapher.py
(Requires the other Python modules in that directory, and PyGTK)

This screenshot was taken just after the priosched.c test program
flipped from the "bad" state to the "good" state. It had been running
on one CPU for a while, and something caused it to rebalance properly
onto both CPUs while this graph was being recorded. You can see that
in the "bad" state, cpu0 was trying to rebalance but it was failing
because there was no busier group (idle_nobusyg):

http://navi.cx/~micah/priosched-graph-1.png

While in the "bad" state, the sched_debug output shows clearly why we
were hitting idle_nobusyg. One CPU has both busy-loop threads in its
runqueue, the other CPU is mostly idle but it has an elevated CPU load
proportional to the high priority thread's priority boost.

micah@micah-64:~$ cat /proc/sched_debug
Sched Debug Version: v0.05-v20, 2.6.23.1 #1
now at 287991360677873 nsecs

cpu#0, 2593.095 MHz
.nr_running : 0
.load : 0
.ls.delta_fair : 27950
.ls.delta_exec : 1958562
.nr_switches : 42562597
.nr_load_updates : 71998668
.nr_uninterruptible : 2731
.jiffies : 4366886827
.next_balance : 4366886830
.curr->pid : 0
.clock : 288012673563036
.idle_clock : 0
.prev_clock_raw : 287377858327690
.clock_warps : 0
.clock_overflows : 50589
.clock_deep_idle_events : 0
.clock_max_delta : 4000250
.cpu_load[0] : 58784
.cpu_load[1] : 64246
.cpu_load[2] : 65333
.cpu_load[3] : 63586
.cpu_load[4] : 61992

cfs_rq
.fair_clock : 9034352848909
.exec_clock : 10184480087500
.wait_runtime : 0
.wait_runtime_overruns : 10356815
.wait_runtime_underruns : 1551195
.sleeper_bonus : 33752273
.wait_runtime_rq_sum : 0

runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------

cpu#1, 2593.095 MHz
.nr_running : 3
.load : 3072
.ls.delta_fair : 247507
.ls.delta_exec : 1148874
.nr_switches : 78415958
.nr_load_updates : 71998585
.nr_uninterruptible : -2731
.jiffies : 4366886827
.next_balance : 4366886976
.curr->pid : 16252
.clock : 288012340795477
.idle_clock : 0
.prev_clock_raw : 287377857592876
.clock_warps : 0
.clock_overflows : 18670
.clock_deep_idle_events : 0
.clock_max_delta : 4000250
.cpu_load[0] : 7136
.cpu_load[1] : 4591
.cpu_load[2] : 3319
.cpu_load[3] : 2691
.cpu_load[4] : 2391

cfs_rq
.fair_clock : 10089416808778
.exec_clock : 11099173384388
.wait_runtime : -74868320
.wait_runtime_overruns : 18282511
.wait_runtime_underruns : 1233484
.sleeper_bonus : 3602202
.wait_runtime_rq_sum : -74868320

runnable tasks:
task PID tree-key delta waiting switches prio sum-exec sum-wait sum-sleep wait-overrun wait-underrun
------------------------------------------------------------------------------------------------------------------------------------------------------------------
priosched 16248 10089447384328 30575550 -31397913 172 120 1665860833 -80591108 0 0 17
priosched 16249 10089453683739 36874961 -39698014 148 120 1670430581 -63070307 0 0 13
R cat 16252 10089420581171 3772393 -3772393 1 120 73354 -22393 90430 0 0


Thanks,
--Micah
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/