Re: [patch] Re: scheduler oddity [bug?]

From: Mike Galbraith
Date: Mon Mar 09 2009 - 11:31:18 EST


On Mon, 2009-03-09 at 15:41 +0100, Peter Zijlstra wrote:
> On Mon, 2009-03-09 at 15:11 +0100, Mike Galbraith wrote:
>
> > > Yes 2* worked fine. Mysql+oltp was my worry spot, being a very affinity
> > > sensitive little <bleep>, but my patchlet didn't cause any trouble, so
> > > this one shouldn't either. I'll do some re-test in any case, and squeak
> > > should anything turn up.
> >
> > Squeak! Didn't even get to mysql+oltp.
> >
> > marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET
> > Socket Message Elapsed Messages
> > Size Size Time Okay Errors Throughput
> > bytes bytes secs # # 10^6bits/sec
> >
> > 65536 4096 60.00 5161103 0 2818.65
> > 65536 60.00 5149666 2812.40
> >
> > 6188 root 20 0 1040 544 324 R 100 0.0 0:31.49 0 netperf
> > 6189 root 20 0 1044 260 164 R 48 0.0 0:15.35 3 netserver
> >
> > Hurt, pain, ouch, vs...
> >
> > marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,0 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
> > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
> > Socket Message Elapsed Messages
> > Size Size Time Okay Errors Throughput
> > bytes bytes secs # # 10^6bits/sec
> >
> > 65536 4096 60.00 8452028 0 4615.93
> > 65536 60.00 8442945 4610.97
> >
> > Drat.
>
> Bugger, so back to the drawing board it is...

Hm.

CPU utilization wise, this test is similar to pipetest. The major
difference is chunk size. Netperf is waking and being preempted (if on
the same CPU) at a very high rate, so the hog component gets cpu in tiny
chunks, vs hefty chunks for pipetest.

Simply doing the below (will look very familiar) made both netperf and
pipetest happy again, because of that preemption rate. Both start life
wanting to be affine, and due to the switch rate, pipetest becomes
non-affine, but netperf remains affine.

Maybe we should factor in wakeup rate, and whether we're waking many vs
one. Wakeup is tied to data, so there is correlation to potential
cache-miss pain, no?

There is also evidence that your patch did in fact make the right
decision, but that we really REALLY should try to punt to a CPU that
shares a cache if available. Check out the numbers when the netperf
test runs on two CPUs that share cache.

marge:..local/tmp # netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -T 0,1 -- -P 15888,12384 -s 32768 -S 32768 -m 4096
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15888 AF_INET to 127.0.0.1 (127.0.0.1) port 12384 AF_INET : cpu bind
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

65536 4096 60.00 15325632 0 8369.84
65536 60.00 15321176 8367.40

(You can skip the below, nothing new there. Just for completeness;)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8e2558c..0f67b2a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4508,6 +4508,24 @@ static inline void schedule_debug(struct task_struct *prev)
#endif
}

+static void put_prev_task(struct rq *rq, struct task_struct *prev)
+{
+ if (prev->state == TASK_RUNNING) {
+ u64 runtime = prev->se.sum_exec_runtime;
+
+ runtime -= prev->se.prev_sum_exec_runtime;
+ runtime = min_t(u64, runtime, 2*sysctl_sched_migration_cost);
+
+ /*
+ * In order to avoid avg_overlap growing stale when we are
+ * indeed overlapping and hence not getting put to sleep, grow
+ * the avg_overlap on preemption.
+ */
+ update_avg(&prev->se.avg_overlap, runtime);
+ }
+ prev->sched_class->put_prev_task(rq, prev);
+}
+
/*
* Pick up the highest-prio task:
*/
@@ -4586,7 +4604,7 @@ need_resched_nonpreemptible:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

- prev->sched_class->put_prev_task(rq, prev);
+ put_prev_task(rq, prev);
next = pick_next_task(rq, prev);

if (likely(prev != next)) {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/