Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: Peter Zijlstra
Date: Thu Jun 22 2023 - 06:59:11 EST


On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote:
> On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > +struct swqueue {
> > > + struct list_head list;
> > > + spinlock_t lock;
> > > +} ____cacheline_aligned;
> >
> > I'm thinking you can shard this just fine, it makes that pop() needs to
> > iterate all shards, but that shouldn't be a problem, and it would still
> > only need to take a single lock.
>
> Is the idea here to have multiple sharded queues per LLC, with a single
> lock protecting them?

No, each shard will have it's own lock, each shard it's own cacheline.

> Or did I misunderstand your suggestion entirely?

Yup, so each shard is basically the above struct, whole cacheline with a
lock and list. Then on enqueue we hash to a shard (possibly based on cpu
-- this gives the least amount of bounces) and stick it on.

Then on pop we iterate the shards, the first non-empty shard gets the
lock taken, and pop'ed. If you start iteration based on the cpu->shard
map used for enqueue you can even get a sense of 'locality' depending on
the mapping.

So pop is O(n) in shards, but on average will only take a single lock --
there is an obvious race where it might see a non-empty queue, but race
on lock and find the queue empty once acquired, but meh.

> > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.
> >
> > > #ifdef CONFIG_SMP
> >
> > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > > +{
> > > + unsigned long flags;
> > > +
> > > + struct task_struct *p;
> > > +
> > > + spin_lock_irqsave(&swqueue->lock, flags);
> > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > > + swqueue_node);
> > > + if (p)
> > > + list_del_init(&p->swqueue_node);
> > > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > > +
> > > + return p;
> > > +}
> >
> > Would this not normally be called pop() or somesuch?
>
> Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
> swqueue_enqueue() seem like the most canonical. Let me know if you have another
> preference.

Well, there's two layers intermixed here, which is, I think, what gave
rise to the whole wonky naming.

If you implement the queue primitives as: push, pop, del
and then build, using them, the scheduler primitives: enqueue, dequeue,
pick, things should become saner.

> > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > + return;
> >
> > Elsewhere you mentioned heuristics, this smells like them. This and the
> > is_cpus_allowed() thing makes you loose plenty of opportunities.
>
> Yeah fair enough, these certainly are heuristics as well.
>
> I thought it best to try and avoid swqueue getting in the way of
> select_task_rq_fair() (at least to start out with), but we could always
> remove that and run other experiments to see how it does.

So, part of the problem is that you might be hooking at the wrong level
(see also my earlier email about the queue only containing running tasks
and none of the ready tasks).

If you were to hook at the __{en,de}queue_entity() level you'll
automagically dequeue running tasks and enqueue any ready tasks. OTOH
you get to deal with the problem that not all entities are tasks, but
that shouldn't be too hard.

Also, you end up with more enqueue/dequeue -- so that might hurt.

Also, at this point you're nr_cpus/N away from simply scanning the
runqueues and pulling things from non-empty runqueues. But for large
nr_cpus and smallish N it might be a win over-all. Dunno..