Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

From: Chen Yu
Date: Wed May 17 2023 - 12:57:44 EST


On 2023-05-16 at 13:51:26 +0200, Mike Galbraith wrote:
> On Tue, 2023-05-16 at 16:41 +0800, Chen Yu wrote:
> > >
> > IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
> > a system with a large number of LLC and with SMT enabled, the issue is easy to
> > be detected.
> >
> > As an example, I did not choose a super big system,
> > but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.
> >
> > Each pair of tasks are bound to 1 dedicated core:
> > ./context_switch1_processes -s 30 -t 2
> > average:956883
> >
> > No CPU affinity for the tasks:
> > ./context_switch1_processes -s 30 -t 2 -n
> > average:849209
> >
> > We can see that, waking up the wakee on local core brings benefits on this platform.
>
> Sure, tiny ping-pong balls fly really fast in a shared L2 siblings. The
> players being truly synchronous, there's not a whole lot of room for
> resource competition, and they do about as close to nothing but
> schedule as you can get.
>
> That's not a workload, much less one worthy of special consideration.
> It is a useful tool though, exposed a big socket issue, good job tool.
> Changing task placement strategy in response to that issue makes zero
> sense to me. There are many ways to schedule and wake other tasks at
> high frequency, all use the same paths. Reduce the pain that big box
> sees when playing high speed ping-pong, and real workloads will profit
> in boxen large and small. More in large than small, but nobody loses,
> everybody wins.
>
I'm thinking of two directions based on current patch:

1. Check the task duration, if it is a high speed ping-pong pair, let the
wakee search for an idle SMT sibling on current core.

This strategy give the best overall performance improvement, but
the short task duration tweak based on online CPU number would be
an obstacle.

Or

2. Honors the idle core.
That is to say, if there is an idle core in the system, choose that
idle core first. Otherwise, fall back to searching for an idle smt
sibling rather than choosing a idle CPU in a random half-busy core.

This strategy could partially mitigate the C2C overhead, and not
breaking the idle-core-first strategy. So I had a try on it, with
above change, I did see some improvement when the system is around
half busy(afterall, the idle_has_core has to be false):


Core(TM) i9-10980XE Cascade Lake, 18C/36T
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 9-threads 1.00 ( 0.50) +0.48 ( 0.39)
TCP_RR 18-threads 1.00 ( 0.39) +10.95 ( 0.31)
TCP_RR 27-threads 1.00 ( 0.31) +8.82 ( 0.17)
TCP_RR 36-threads 1.00 ( 0.86) +0.02 ( 0.28)
TCP_RR 45-threads 1.00 ( 4.17) +0.22 ( 4.02)
TCP_RR 54-threads 1.00 ( 1.28) -0.32 ( 1.66)
TCP_RR 63-threads 1.00 ( 3.21) +0.28 ( 2.76)
TCP_RR 72-threads 1.00 ( 0.36) -0.38 ( 0.61)
UDP_RR 9-threads 1.00 ( 0.47) -0.19 ( 0.61)
UDP_RR 18-threads 1.00 ( 0.12) +8.30 ( 0.12)
UDP_RR 27-threads 1.00 ( 3.70) +9.62 ( 0.21)
UDP_RR 36-threads 1.00 ( 0.37) +0.02 ( 0.19)
UDP_RR 45-threads 1.00 ( 13.52) -0.76 ( 17.59)
UDP_RR 54-threads 1.00 ( 11.44) +0.41 ( 1.43)
UDP_RR 63-threads 1.00 ( 17.26) +0.42 ( 3.00)
UDP_RR 72-threads 1.00 ( 0.36) -0.29 ( 7.90)

TCP_MAERTS 9-threads 1.00 ( 0.11) +0.04 ( 0.06)
TCP_MAERTS 18-threads 1.00 ( 0.26) +1.60 ( 0.07)
TCP_MAERTS 27-threads 1.00 ( 0.03) -0.02 ( 0.01)
TCP_MAERTS 36-threads 1.00 ( 0.01) -0.01 ( 0.01)
TCP_MAERTS 45-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_MAERTS 54-threads 1.00 ( 0.00) +0.00 ( 0.01)
TCP_MAERTS 63-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_MAERTS 72-threads 1.00 ( 0.01) +0.00 ( 0.00)
TCP_SENDFILE 9-threads 1.00 ( 0.10) +0.34 ( 0.25)
TCP_SENDFILE 18-threads 1.00 ( 0.01) -0.02 ( 0.03)
TCP_SENDFILE 27-threads 1.00 ( 0.02) -0.01 ( 0.01)
TCP_SENDFILE 36-threads 1.00 ( 0.00) -0.00 ( 0.00)
TCP_SENDFILE 45-threads 1.00 ( 0.00) +0.00 ( 0.00)
TCP_SENDFILE 54-threads 1.00 ( 0.00) +0.00 ( 0.00)
TCP_SENDFILE 63-threads 1.00 ( 0.00) +0.00 ( 0.00)
TCP_SENDFILE 72-threads 1.00 ( 0.00) -0.00 ( 0.00)
TCP_STREAM 9-threads 1.00 ( 0.01) -0.12 ( 0.01)
TCP_STREAM 18-threads 1.00 ( 0.37) +1.46 ( 0.10)
TCP_STREAM 27-threads 1.00 ( 0.01) -0.01 ( 0.02)
TCP_STREAM 36-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 45-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 54-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 63-threads 1.00 ( 0.01) +0.00 ( 0.01)
TCP_STREAM 72-threads 1.00 ( 0.01) -0.00 ( 0.01)


UDP_STREAM 9-threads 1.00 ( 99.99) +1.40 ( 99.99)
UDP_STREAM 18-threads 1.00 (101.21) +7.33 (100.51)
UDP_STREAM 27-threads 1.00 ( 99.98) +0.03 ( 99.98)
UDP_STREAM 36-threads 1.00 ( 99.97) +0.16 ( 99.97)
UDP_STREAM 45-threads 1.00 (100.06) -0.00 (100.06)
UDP_STREAM 54-threads 1.00 (100.02) +0.10 (100.02)
UDP_STREAM 63-threads 1.00 (100.25) +0.05 (100.25)
UDP_STREAM 72-threads 1.00 ( 99.89) -0.03 ( 99.87)

The UDP_STREAM seems to be quite unstable, ignore it for now...

Xeon(R) Platinum 8480+ Sapphire Rapids, 56C/112T, 1 socket online
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 28-threads 1.00 ( 2.19) -0.32 ( 2.26)
TCP_RR 56-threads 1.00 ( 4.09) -0.22 ( 3.85)
TCP_RR 84-threads 1.00 ( 0.23) +10.11 ( 0.27)
TCP_RR 112-threads 1.00 ( 0.16) +5.12 ( 0.19)
TCP_RR 140-threads 1.00 ( 9.58) -0.08 ( 10.26)
TCP_RR 168-threads 1.00 ( 10.96) -0.61 ( 11.98)
TCP_RR 196-threads 1.00 ( 14.20) -1.31 ( 14.34)
TCP_RR 224-threads 1.00 ( 9.70) -0.84 ( 9.80)
UDP_RR 28-threads 1.00 ( 0.99) -0.29 ( 1.06)
UDP_RR 56-threads 1.00 ( 6.68) +0.96 ( 7.26)
UDP_RR 84-threads 1.00 ( 15.05) +10.37 ( 24.45)
UDP_RR 112-threads 1.00 ( 8.55) +3.98 ( 12.17)
UDP_RR 140-threads 1.00 ( 14.55) +0.25 ( 12.53)
UDP_RR 168-threads 1.00 ( 12.87) -0.81 ( 18.60)
UDP_RR 196-threads 1.00 ( 16.61) +0.10 ( 17.12)
UDP_RR 224-threads 1.00 ( 15.81) +0.00 ( 14.11)

Similarly there are improvements when system is above half-busy.

[Limitations]
Per the test, this patch only brings benefits when there is an idle sibling
in the SMT domain and no idle core available. This is a trade-off to not
break the latency of low end platform but mitigate the C2C on high end ones.



diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d2613ab392c..572d663065e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7126,6 +7126,23 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
asym_fits_cpu(task_util, util_min, util_max, target))
return target;

+ /*
+ * If the waker and the wakee are good friends to each other,
+ * putting them within the same SMT domain could reduce C2C
+ * overhead. But this only applies when there is no idle core
+ * available. SMT idle sibling should be prefered to wakee's
+ * previous CPU, because the latter could still have the risk of C2C
+ * overhead.
+ *
+ */
+ if (sched_feat(SIS_PAIR) && sched_smt_active() && !has_idle_core &&
+ current->last_wakee == p && p->last_wakee == current) {
+ i = select_idle_smt(p, smp_processor_id());
+
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+
/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..86b5c4f16199 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_PAIR, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1