Re: [RFC PATCH] sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

From: Chen Yu
Date: Tue May 16 2023 - 04:41:56 EST


On 2023-05-16 at 08:23:35 +0200, Mike Galbraith wrote:
> On Tue, 2023-05-16 at 09:11 +0800, Chen Yu wrote:
> > [Problem Statement]
> >
> ...
>
> > 20.26%    19.89%  [kernel.kallsyms]          [k] update_cfs_group
> > 13.53%    12.15%  [kernel.kallsyms]          [k] update_load_avg
>
> Yup, that's a serious problem, but...
>
> > [Benchmark]
> >
> > The baseline is on sched/core branch on top of
> > commit a6fcdd8d95f7 ("sched/debug: Correct printing for rq->nr_uninterruptible")
> >
> > Tested will-it-scale context_switch1 case, it shows good improvement
> > both on a server and a desktop:
> >
> > Intel(R) Xeon(R) Platinum 8480+, Sapphire Rapids 2 x 56C/112T = 224 CPUs
> > context_switch1_processes -s 100 -t 112 -n
> > baseline                   SIS_PAIR
> > 1.0                        +68.13%
> >
> > Intel Core(TM) i9-10980XE, Cascade Lake 18C/36T
> > context_switch1_processes -s 100 -t 18 -n
> > baseline                   SIS_PAIR
> > 1.0                        +45.2%
>
> git@homer: ./context_switch1_processes -s 100 -t 8 -n
> (running in an autogroup)
>
> PerfTop: 30853 irqs/sec kernel:96.8% exact: 96.8% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 8 CPUs)
> ------------------------------------------------------------------------------------------------------------
>
> 5.72% [kernel] [k] switch_mm_irqs_off
> 4.23% [kernel] [k] __update_load_avg_se
> 3.76% [kernel] [k] __update_load_avg_cfs_rq
> 3.70% [kernel] [k] __schedule
> 3.65% [kernel] [k] entry_SYSCALL_64
> 3.22% [kernel] [k] enqueue_task_fair
> 2.91% [kernel] [k] update_curr
> 2.67% [kernel] [k] select_task_rq_fair
> 2.60% [kernel] [k] pipe_read
> 2.55% [kernel] [k] __switch_to
> 2.54% [kernel] [k] __calc_delta
> 2.44% [kernel] [k] dequeue_task_fair
> 2.38% [kernel] [k] reweight_entity
> 2.13% [kernel] [k] pipe_write
> 1.96% [kernel] [k] restore_fpregs_from_fpstate
> 1.93% [kernel] [k] select_idle_smt
> 1.77% [kernel] [k] update_load_avg <==
> 1.73% [kernel] [k] native_sched_clock
> 1.66% [kernel] [k] try_to_wake_up
> 1.52% [kernel] [k] _raw_spin_lock_irqsave
> 1.47% [kernel] [k] update_min_vruntime
> 1.42% [kernel] [k] update_cfs_group <==
> 1.36% [kernel] [k] vfs_write
> 1.32% [kernel] [k] prepare_to_wait_event
>
> ...not one with global scope. My little i7-4790 can play ping-pong all
> day long, as can untold numbers of other boxen around the globe.
>
That is true, on smaller systems, the C2C overhead is not that high.
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..e65028dcd6a6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7125,6 +7125,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >             asym_fits_cpu(task_util, util_min, util_max, target))
> >                 return target;
> >  
> > +       /*
> > +        * If the waker and the wakee are good friends to each other,
> > +        * putting them within the same SMT domain could reduce C2C
> > +        * overhead. SMT idle sibling should be preferred to wakee's
> > +        * previous CPU, because the latter could still have the risk of C2C
> > +        * overhead.
> > +        */
> > +       if (sched_feat(SIS_PAIR) && sched_smt_active() &&
> > +           current->last_wakee == p && p->last_wakee == current) {
> > +               i = select_idle_smt(p, smp_processor_id());
> > +
> > +               if ((unsigned int)i < nr_cpumask_bits)
> > +                       return i;
> > +       }
> > +
> >         /*
> >          * If the previous CPU is cache affine and idle, don't be stupid:
> >          */
>
> Global scope solutions for non-global issues tend to not work out.  
>
> Below is a sample of potential scaling wreckage for boxen that are NOT
> akin to the one you're watching turn caches into silicon based pudding.
>
> Note the *_RR numbers. Those poked me in the eye because they closely
> resemble pipe ping-pong, all fun and games with about as close to zero
> work other than scheduling as network-land can get, but for my box, SMT
> was the third best option of three.
>
> You just can't beat idle core selection when it comes to getting work
> done, which is why SIS evolved to select cores first.
>
There could be some corner cases. Under some conditions choosing an idle
CPU within the local core might be better to select a new idle core. The tricky
part is that SMT is quite special, SMTs share L2, but SMTs also
compete for other critical resources. For short tasks having a close relationship with
each other, putting them together on a local Core (on a high count
system) could sometimes bring benefit. The short duration means that the task
pair have less chance to compete for instruction unit shared by SMTs.
But the short-duration threshold depends on the number of CPUs in the LLC.
> Your box and ilk need help that treats the disease and not the symptom,
> or barring that, help that precisely targets boxen having the disease.
>
IMO this issue could be generic, the cost of C2C is O(sqrt (n)), in theory on
a system with a large number of LLC and with SMT enabled, the issue is easy to
be detected.

As an example, I did not choose a super big system,
but a desktop i9-10980XE, launches 2 pairs of ping-ping tasks.

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 2
average:956883

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 2 -n
average:849209

We can see that, waking up the wakee on local core brings benefits on this platform.

To make a comparison, I also launched the same test on my laptop
i5-8300H, which has 4Core/8CPUs, and I did not see any difference when running 2 pairs
of will-it-scale, but I did notice an improvement if wakees are woken up on local
core when launching 4 pairs(I guess this is because C2C reduction accumulates):

Each pair of tasks are bound to 1 dedicated core:
./context_switch1_processes -s 30 -t 4
average:731965

No CPU affinity for the tasks:
./context_switch1_processes -s 30 -t 4 -n
average:644337


thanks,
Chenyu

> -Mike
>
> 10 seconds of 1 netperf client/server instance, no knobs twiddled.
>
> TCP_SENDFILE-1 stacked Avg: 65387
> TCP_SENDFILE-1 cross-smt Avg: 65658
> TCP_SENDFILE-1 cross-core Avg: 96318
>
> TCP_STREAM-1 stacked Avg: 44322
> TCP_STREAM-1 cross-smt Avg: 42390
> TCP_STREAM-1 cross-core Avg: 77850
>
> TCP_MAERTS-1 stacked Avg: 36636
> TCP_MAERTS-1 cross-smt Avg: 42333
> TCP_MAERTS-1 cross-core Avg: 74122
>
> UDP_STREAM-1 stacked Avg: 52618
> UDP_STREAM-1 cross-smt Avg: 55298
> UDP_STREAM-1 cross-core Avg: 97415
>
> TCP_RR-1 stacked Avg: 242606
> TCP_RR-1 cross-smt Avg: 140863
> TCP_RR-1 cross-core Avg: 219400
>
> UDP_RR-1 stacked Avg: 282253
> UDP_RR-1 cross-smt Avg: 202062
> UDP_RR-1 cross-core Avg: 288620