Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: Gautham R. Shenoy
Date: Fri Jun 23 2023 - 05:51:14 EST


On Thu, Jun 22, 2023 at 12:29:35PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 22, 2023 at 02:41:57PM +0530, Gautham R. Shenoy wrote:
>
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > {
> > unsigned long flags;
> > struct swqueue *swqueue;
> > - bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > - bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> >
> > /*
> > * Only enqueue the task in the shared wakequeue if:
> > *
> > * - SWQUEUE is enabled
> > - * - The task is on the wakeup path
> > - * - The task wasn't purposefully migrated to the current rq by
> > - * select_task_rq()
> > - * - The task isn't pinned to a specific CPU
> > + * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
> > */
> > - if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
>
> You lost the nr_cpus_allowed == 1 case, that actually made sense, except
> he didn't don't have a hook in set_cpus_allowed() to fix it back up.
>

I didn't retain the "nr_cpus_allowed == 1" check because
select_task_rq() calls select_task_rq_fair() only when
p->nr_cpus_allowed > 1.

So if p was affined to a single CPU, it would always have
p->sched_add_to_swq set to 0, thereby not queueing it on the 'queue'.


> > + if (!READ_ONCE(p->sched_add_to_swq))
> > return;
>
> So suppose we fill all CPUs with tasks T_{0..n-n1}, sis finds them a
> nice idle cpu and we don't get queued.

Yes.

>
> Then wake up another n tasks T_{n...2n-1}, none of them find an idle
> CPU, as per the above them all taken. These all do get queued.

Yes.

>
> However, suppose T>=n does a wakeup-preemption, and we end up with T>=n
> running while T<n get queued on the rq, but not the 'queue' (I so hate
> the swqueue name).

Yes. But note that prior to running T>=n on the CPU, it will be
removed from the 'queue'. So if every T>=n preempts the corresponding
T<n, then, the queue becomes empty. And yes, in this case the queue
served no purpose and does not justify the additional overhead.

>
> Then one CPU has both it's tasks go 'away' and becomes idle.

Yes, and this CPU prior to newidle_balance() will now search the
'queue'.

>
> At this point the 'queue' contains only running tasks that are not
> migratable and all the tasks that could be migrated are not on the queue
> -- IOW it is completely useless.

At this point, the CPU will call swqueue_pick_next_task() and we can
have one of the three cases:

* The 'queue' is empty: As mentioned above, the queue didn't do
anything for the tasks and was cmpletely useless.

* The task at the head of 'queue' is not runnable on the CPU which
is going idle. Again, wasted effort of adding this task to the
queue, because in the current implementation, the task is
removed from the 'queue' even when it is not runnable on this
CPU.

* The 'queue' has some task which can be run on the CPU going
idle. The task is successfully migrated to this CPU, which no
longer has to do newidle_balance() and the task has reduced it
waiting period.

>
> Is this the intention?

IIUC, the intention is to reduce the avg post-wakeup wait time of the
task by running it on a newidle CPU, when the task's CPU is still busy
running something else. This is assuming that the task won't win
wakeup-preemption. But the flipside is the additional cost of swqueue
enqueue/dequeue.

I have queued a run across a set of benchmarks, but just to get an
idea how the patch performs, I ran hackbench and this is what the
results look like on a 2 Socket Gen3 EPYC configured in NPS=2 with 64
cores 128 threads per socket:

Test: tip tip_swq_disabled tip_swq_enabled
=============================================================================
1-groups: 3.82 (0.00 pct) 3.84 (-0.52 pct) 3.46 (+9.42 pct)
2-groups: 4.40 (0.00 pct) 4.42 (-0.45 pct) 3.66 (+16.81 pct)
4-groups: 4.84 (0.00 pct) 4.82 (+0.41 pct) 4.50 (+7.02 pct)
8-groups: 5.45 (0.00 pct) 5.43 (+0.36 pct) 5.85 (-7.33 pct)
16-groups: 6.94 (0.00 pct) 6.90 (+0.57 pct) 8.41 (-21.18 pct)

We can see that patchset does quite well with 1,2,4 groups, but with 8
and 16 groups as the system becomes overloaded, we can see the
performance degrade. In the overloaded case, I could observe
native_queued_spin_lock_slowpath() looming large in the perf profiles
when the swqueue feature was enabled. So perhaps the cost of swqueue
enqueue/dequeue is not justifiable, especially if the tasks are anyway
going to run on their target CPUs.

I will post more results later.
--
Thanks and Regards
gautham.