Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: Gautham R. Shenoy
Date: Thu Jun 22 2023 - 05:22:27 EST


On Wed, Jun 21, 2023 at 08:43:29PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 01:47:00PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
> > On Tue, Jun 20, 2023 at 03:08:22PM -0500, David Vernet wrote:
> > > On Mon, Jun 19, 2023 at 11:43:13AM +0530, Gautham R. Shenoy wrote:
> > > > Hello David,
> > > >
> > > >
> > > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > > [..snip..]
> > > >
> > > > > +static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
> > > > > +{
> > > > > + unsigned long flags;
> > > > > + struct swqueue *swqueue;
> > > > > + bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
> > > > > + bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;
> > > > > +
> > > > > + /*
> > > > > + * Only enqueue the task in the shared wakequeue if:
> > > > > + *
> > > > > + * - SWQUEUE is enabled
> > > > > + * - The task is on the wakeup path
> > > > > + * - The task wasn't purposefully migrated to the current rq by
> > > > > + * select_task_rq()
> > > > > + * - The task isn't pinned to a specific CPU
> > > > > + */
> > > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > > + return;
> > > >
> > > > In select_task_rq_fair(), having determined if the target of task
> > > > wakeup should be the task's previous CPU vs the waker's current CPU,
> > > > we spend quite a bit of time already to determine if there is an idle
> > > > core/CPU in the target's LLC. @rq would correspond to CPU chosen as a
> > > > result of that scan or if no idle CPU exists, @rq corresponds to the
> > > > target CPU determined by wake_affine_idle()/wake_affine_weight().
> > > >
> > > > So if the CPU of @rq is idle here, can we not simply return here?
> > >
> > > Hi Gautum,
> > >
> > > Sorry, I'm not sure I'm quite following the issue you're pointing out.
> > > We don't use swqueue if the task was migrated following
> > > select_task_rq_fair(). That's the idea with us returning if the task was
> > > migrated (the second conditional in that if). If I messed up that logic
> > > somehow, it should be fixed.
> >
> > Sorry, my bad. I see it now.
> >
> > So as per this patch, the only time we enqueue the task on the shared
> > wakeup is if the target of try_to_wake_up() is the same CPU where the
> > task ran previously.
> >
> > When wake_affine logic fails and the previous CPU is chosen as the
> > target, and when there are no other idle cores/threads in the LLC of
> > the previous CPU, it makes sense to queue the task on the
> > shared-wakequeue instead of on a busy previous CPU.
> >
> > And when that previous CPU is idle, the try_to_wake_up() would have
> > woken it up via ttwu_queue(), so before going idle the next time it
> > will check the shared queue for the task and find it. We should be
> > good in this case.
> >
> > Now, it is possible that select_task_rq_fair() ended up selecting the
> > waker's CPU as the target based on the
> > wake_affine_idle()/wake_affine_weight() logic. And if there is no idle
> > core/thread on the waker's LLC, the target would be the busy waker
> > CPU. In the case when the waker CPU is different from the task's
> > previous CPU, due to ENQUEUE_MIGRATE flag being set, the task won't be
> > queued on the shared wakequeue and instead has to wait on the busy
> > waker CPU.
> >
> > I wonder if it makes sense to enqueue the task on the shared wakequeue
> > in this scenario as well.
>
> Hello Gautham,
>
> That's a good point. My original intention with opting out of using
> swqueue if select_task_rq_fair() caused us to migrate is so that it
> wouldn't interfere with decisions made with other select_task_rq_fair()
> heuristics like wake_affine_*(). Basically just minimizing the possible
> impact of swqueue.
> That said, I think it probably does make sense to
> just enqueue in the swqueue regardless of whether ENQUEUE_MIGRATED is
> set. One of the main goals of swqueue is work conservation, and in
> hindsight it does feel somewhat artificial to add a heuristic that works
> against that.

In that case we can perhaps have an explicit flag that is passed by
try_to_wake_up() when it cannot find an idle CPU and the chosen target
is just a fallback. The task gets enqueued on the swqueue of the
target only in such cases. Something like the following entirely
untested:

---------------------------- x8----------------------------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b64fec55a381..38005262a7fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -910,6 +910,12 @@ struct task_struct {
*/
unsigned sched_remote_wakeup:1;

+ /*
+ * Bit used by select_idle_sibling() to signal enqueuing the
+ * task on a shared wakequeue.
+ */
+ unsigned sched_add_to_swq:1;
+
/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
unsigned in_iowait:1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e311d1c3b992..f4246c33f3c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -215,21 +215,17 @@ static void swqueue_enqueue(struct rq *rq, struct task_struct *p, int enq_flags)
{
unsigned long flags;
struct swqueue *swqueue;
- bool task_migrated = enq_flags & ENQUEUE_MIGRATED;
- bool task_wakeup = enq_flags & ENQUEUE_WAKEUP;

/*
* Only enqueue the task in the shared wakequeue if:
*
* - SWQUEUE is enabled
- * - The task is on the wakeup path
- * - The task wasn't purposefully migrated to the current rq by
- * select_task_rq()
- * - The task isn't pinned to a specific CPU
+ * - select_idle_sibling() didn't find an idle CPU to enqueue this wakee task.
*/
- if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
+ if (!READ_ONCE(p->sched_add_to_swq))
return;

+ WRITE_ONCE(p->sched_add_to_swq, 0);
swqueue = rq_swqueue(rq);
spin_lock_irqsave(&swqueue->lock, flags);
list_add_tail(&p->swqueue_node, &swqueue->list);
@@ -7361,6 +7357,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

+ /*
+ * No idle sibling was found. Ok to queue this task on the
+ * shared wakequeue of the target.
+ */
+ WRITE_ONCE(p->sched_add_to_swq, 1);
return target;
}


>
> I'd like to hear what others think. In my opinion it's worth at least
> running some tests on workloads that heavily utilize the CPU such as
> kernel compile, and seeing what the outcomes are.

I will try and get some numbers for such workloads on our setup over
this weekend.

>
> Thanks,
> David

--
Thanks and Regards
gautham.