Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks within one LLC

From: Vincent Guittot
Date: Wed Apr 28 2021 - 09:04:35 EST


On Wed, 28 Apr 2021 at 11:51, Song Bao Hua (Barry Song)
<song.bao.hua@xxxxxxxxxxxxx> wrote:
>
>
>
> > -----Original Message-----
> > From: Dietmar Eggemann [mailto:dietmar.eggemann@xxxxxxx]
> > Sent: Tuesday, April 27, 2021 11:36 PM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>;
> > tim.c.chen@xxxxxxxxxxxxxxx; catalin.marinas@xxxxxxx; will@xxxxxxxxxx;
> > rjw@xxxxxxxxxxxxx; vincent.guittot@xxxxxxxxxx; bp@xxxxxxxxx;
> > tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; lenb@xxxxxxxxxx; peterz@xxxxxxxxxxxxx;
> > rostedt@xxxxxxxxxxx; bsegall@xxxxxxxxxx; mgorman@xxxxxxx
> > Cc: msys.mizuma@xxxxxxxxx; valentin.schneider@xxxxxxx;
> > gregkh@xxxxxxxxxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>;
> > juri.lelli@xxxxxxxxxx; mark.rutland@xxxxxxx; sudeep.holla@xxxxxxx;
> > aubrey.li@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx;
> > linux-kernel@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; x86@xxxxxxxxxx;
> > xuwei (O) <xuwei5@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>;
> > guodong.xu@xxxxxxxxxx; yangyicong <yangyicong@xxxxxxxxxx>; Liguozhu (Kenneth)
> > <liguozhu@xxxxxxxxxxxxx>; linuxarm@xxxxxxxxxxxxx; hpa@xxxxxxxxx
> > Subject: Re: [RFC PATCH v6 3/4] scheduler: scan idle cpu in cluster for tasks
> > within one LLC
> >
> > On 20/04/2021 02:18, Barry Song wrote:
> >
> > [...]
> >
> > > @@ -5786,11 +5786,12 @@ static void record_wakee(struct task_struct *p)
> > > * whatever is irrelevant, spread criteria is apparent partner count exceeds
> > > * socket size.
> > > */
> > > -static int wake_wide(struct task_struct *p)
> > > +static int wake_wide(struct task_struct *p, int cluster)
> > > {
> > > unsigned int master = current->wakee_flips;
> > > unsigned int slave = p->wakee_flips;
> > > - int factor = __this_cpu_read(sd_llc_size);
> > > + int factor = cluster ? __this_cpu_read(sd_cluster_size) :
> > > + __this_cpu_read(sd_llc_size);
> >
> > I don't see that the wake_wide() change has any effect here. None of the
> > sched domains has SD_BALANCE_WAKE set so a wakeup (WF_TTWU) can never
> > end up in the slow path.
>
> I am really confused. The whole code has only checked if wake_flags
> has WF_TTWU, it has never checked if sd_domain has SD_BALANCE_WAKE flag.

look at :
#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */

so when wake_wide return false, we use the wake_affine mecanism but
if it's false then we fllback to default mode which looks for:
if (tmp->flags & sd_flag)

This means looking for SD_BALANCE_WAKE which is never set

so sd will stay NULL and you will end up calling select_idle_sibling anyway

>
> static int
> select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> {
> ...
>
> if (wake_flags & WF_TTWU) {
> record_wakee(p);
>
> if (sched_energy_enabled()) {
> new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> if (new_cpu >= 0)
> return new_cpu;
> new_cpu = prev_cpu;
> }
>
> want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> }
> }
>
> And try_to_wake_up() has always set WF_TTWU:
> static int
> try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> {
> cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
> ...
> }
>
> So the change in wake_wide will actually affect the value of want_affine.
> And I did also see code entered slow path during my benchmark.
>
> One issue I mentioned during linaro open discussion is that
> since I have moved to use cluster size to decide the value
> of wake_wide, relatively less tasks will make wake_wide()
> decide to go to slow path, thus, tasks begin to spread to
> other NUMA, but actually llc_size might be able to contain
> those tasks. So a possible model might be:
> static int wake_wide(struct task_struct *p)
> {
> tasksize < cluster : scan cluster
> tasksize > llc : slow path
> tasksize > cluster && tasksize < llc: scan llc
> }
>
> thoughts?
>
> > Have you seen a diff when running your `lmbench stream` workload in what
> > wake_wide() returns when you use `sd cluster size` instead of `sd llc
> > size` as factor?
> >
> > I guess for you, wakeups are now subdivided into faster (cluster = 4
> > CPUs) and fast (llc = 24 CPUs) via sis(), not into fast (sis()) and slow
> > (find_idlest_cpu()).
> >
> > >
> > > if (master < slave)
> > > swap(master, slave);
> >
> > [...]
> >
> > > @@ -6745,6 +6748,12 @@ static int find_energy_efficient_cpu(struct
> > task_struct *p, int prev_cpu)
> > > int want_affine = 0;
> > > /* SD_flags and WF_flags share the first nibble */
> > > int sd_flag = wake_flags & 0xF;
> > > + /*
> > > + * if cpu and prev_cpu share LLC, consider cluster sibling rather
> > > + * than llc. this is typically true while tasks are bound within
> > > + * one numa
> > > + */
> > > + int cluster = sched_cluster_active() && cpus_share_cache(cpu, prev_cpu, 0);
> >
> > So you changed from scanning cluster before LLC to scan either cluster
> > or LLC.
>
> Yes, I have seen two ugly things for scanning cluster before scanning LLC
> in select_idle_cpu.
> 1. avg_scan_cost is actually measuring the scanning time. But if we scan
> cluster before scanning LLC, during the gap of these two different
> domains, we need a huge bit operation and this bit operation is not
> a scanning operation at all. This makes the scan_cost quite
> untrustworthy particularly "nr" can sometimes be < cluster size, sometimes
> > cluster size.
>
> 2. select_idle_cpu() is actually the last step of wake_affine, before
> that, wake_affine code has been totally depending on cpus_share_cache()
> to decide the target to scan from. When waker and wakee have been already
> in one LLC, if we only change select_idle_cpu(), at that time, decision
> has been made. we may lose some chance to choose the right target to scan
> from. So it should be more sensible to let cpus_share_cache() check cluster
> when related tasks have been in one same LLC.
>
> >
> > And this is based on whether `this_cpu` and `prev_cpu` are sharing LLC
> > or not. So you only see an effect when running the workload with
> > `numactl -N X ...`.
>
> Ideally, I'd expect this can also positively affect tasks located in
> different LLCs.
> For example, if taskA and taskB are in different NUMAs(also LLC for both
> kunpeng920 and Tim's hardware) at the first beginning, a two-stage packing
> might make them take the advantage of cluster:
> For the first wake-up, taskA and taskB will be put in one LLC by scanning
> LLC;
> For the second wake-up, they might be put in one cluster by scanning
> cluster.
> So ideally, scanning LLC and scanning cluster can work in two stages
> for related tasks and pack them step by step. Practically, this
> could happen. But LB between NUMA might take the opposite way. Actually,
> for a kernel completely *without* cluster patch, I have seen some
> serious ping-pong of tasks in two numa nodes due to the conflict
> of wake_affine and LB. this kind of ping-pong could seriously affect
> the performance.
> For example, for g=6,12,18,24,28,32, I have found running same workload
> on 2numa shows so much worse latency than doing that on single one
> numa(each numa has 24 cpus).
> 1numa command: numactl -N 0 hackbench -p -T -l 1000000 -f 1 -g $1
> 2numa command: numactl -N 0-1 hackbench -p -T -l 1000000 -f 1 -g $1
>
> Measured the average latency of 20 times for each command.
>
> *)without cluster scheduler, 2numa vs 1numa:
> g = 6 12 18 24 28 32
> 1numa 1.2474 1.5635 1.5133 1.4796 1.6177 1.7898
> 2numa 4.1997 5.8172 6.0159 7.2343 6.8264 6.5419
>
> BTW, my cluster patch actually also improves 2numa:
> *)with cluster scheduler 2numa vs 1numa:
> g = 6 12 18 24 28 32
> 1numa 0.9500 1.0728 1.1756 1.2201 1.4166 1.5464
> 2numa 3.5404 4.3744 4.3988 4.6994 5.3117 5.4908
>
> *) 2numa w/ and w/o cluster:
> g = 6 12 18 24 28 32
> 2numa w/o 4.1997 5.8172 6.0159 7.2343 6.8264 6.5419
> 2numa w/ 3.5404 4.3744 4.3988 4.6994 5.3117 5.4908
>
> Ideally, load balance should try to pull unmarried tasks rather than
> married tasks. I mean, if we have
> groupA: task1+task2 as couple, task3 as bachelor
> groupB: task4.
> groupB should try to pull task3. But I feel it is extremely hard to let
> LB understand who is married and who is unmarried.
>
> I assume 2numa worse than 1numa should be a different topic
> which might be worth more investigation.
>
> On the other hand, use cases I'm now targeting at are really using
> "numactl -N x" to bind processes in one NUMA. If we ignore other NUMA
> (also other LLCs) and think one NUMA as a whole system, cluster would
> be the last-level topology scheduler can use. And the code could be
> quite clean to directly leverage the existing select_sibling code for
> LLC by simply changing cpus_share_cache() to cluster level.
>
> >
> > >
> > > if (wake_flags & WF_TTWU) {
> > > record_wakee(p);
> > > @@ -6756,7 +6765,7 @@ static int find_energy_efficient_cpu(struct task_struct
> > *p, int prev_cpu)
> > > new_cpu = prev_cpu;
> > > }
> > >
> > > - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> > > + want_affine = !wake_wide(p, cluster) && cpumask_test_cpu(cpu,
> > p->cpus_ptr);
> > > }
> > >
> > > rcu_read_lock();
> > > @@ -6768,7 +6777,7 @@ static int find_energy_efficient_cpu(struct task_struct
> > *p, int prev_cpu)
> > > if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> > > cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> > > if (cpu != prev_cpu)
> > > - new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);
> > > + new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync, cluster);
> > >
> > > sd = NULL; /* Prefer wake_affine over balance flags */
> > > break;
> > > @@ -6785,7 +6794,7 @@ static int find_energy_efficient_cpu(struct task_struct
> > *p, int prev_cpu)
> > > new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
> > > } else if (wake_flags & WF_TTWU) { /* XXX always ? */
> > > /* Fast path */
> > > - new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > > + new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, cluster);
> > >
> > > if (want_affine)
> > > current->recent_used_cpu = cpu;
> >
> > [...]
>
> Thanks
> Barry
>