Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: David Vernet
Date: Thu Jun 22 2023 - 10:45:10 EST


On Thu, Jun 22, 2023 at 12:58:41PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 21, 2023 at 03:34:45PM -0500, David Vernet wrote:
> > On Wed, Jun 21, 2023 at 04:20:20PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 13, 2023 at 12:20:04AM -0500, David Vernet wrote:
> > > > +struct swqueue {
> > > > + struct list_head list;
> > > > + spinlock_t lock;
> > > > +} ____cacheline_aligned;
> > >
> > > I'm thinking you can shard this just fine, it makes that pop() needs to
> > > iterate all shards, but that shouldn't be a problem, and it would still
> > > only need to take a single lock.
> >
> > Is the idea here to have multiple sharded queues per LLC, with a single
> > lock protecting them?
>
> No, each shard will have it's own lock, each shard it's own cacheline.
>
> > Or did I misunderstand your suggestion entirely?
>
> Yup, so each shard is basically the above struct, whole cacheline with a
> lock and list. Then on enqueue we hash to a shard (possibly based on cpu
> -- this gives the least amount of bounces) and stick it on.
>
> Then on pop we iterate the shards, the first non-empty shard gets the
> lock taken, and pop'ed. If you start iteration based on the cpu->shard
> map used for enqueue you can even get a sense of 'locality' depending on
> the mapping.

That's nifty. Even if we don't get locality from it, it seems prudent to
do to avoid contention.

> So pop is O(n) in shards, but on average will only take a single lock --

Ah, so it's 1 lock on average -- thanks for clarifying.

I like this idea and will implement it for v2. I'm not keen on the idea
of artificially having multiple independent swqueues (yes, I'll think of
a less evil name) per-LLC to avoid contending, but sharding like this
seems like a simple but effective solution to the scalability issue;
assuming having a sufficiently high N doesn't slow down the pop path by
too much as you pointed out below.

> there is an obvious race where it might see a non-empty queue, but race
> on lock and find the queue empty once acquired, but meh.

Yep, I would be surprised if this race caused a problem in practice.
Especially if we spread the poppers by starting iteration based on the
cpu->shared map as you suggested.

> > > I'm thinking 4 or 8 shards should be plenty, even for Intel LLC.

I'll try out a few combinations and see what works.

> > >
> > > > #ifdef CONFIG_SMP
> > >
> > > > +static struct task_struct *swqueue_pull_task(struct swqueue *swqueue)
> > > > +{
> > > > + unsigned long flags;
> > > > +
> > > > + struct task_struct *p;
> > > > +
> > > > + spin_lock_irqsave(&swqueue->lock, flags);
> > > > + p = list_first_entry_or_null(&swqueue->list, struct task_struct,
> > > > + swqueue_node);
> > > > + if (p)
> > > > + list_del_init(&p->swqueue_node);
> > > > + spin_unlock_irqrestore(&swqueue->lock, flags);
> > > > +
> > > > + return p;
> > > > +}
> > >
> > > Would this not normally be called pop() or somesuch?
> >
> > Yes, I'll improve the name in the next iteration. swqueue_dequeue() and
> > swqueue_enqueue() seem like the most canonical. Let me know if you have another
> > preference.
>
> Well, there's two layers intermixed here, which is, I think, what gave
> rise to the whole wonky naming.
>
> If you implement the queue primitives as: push, pop, del
> and then build, using them, the scheduler primitives: enqueue, dequeue,
> pick, things should become saner.

Right, will do

> > > > + if (!task_wakeup || task_migrated || p->nr_cpus_allowed == 1)
> > > > + return;
> > >
> > > Elsewhere you mentioned heuristics, this smells like them. This and the
> > > is_cpus_allowed() thing makes you loose plenty of opportunities.
> >
> > Yeah fair enough, these certainly are heuristics as well.
> >
> > I thought it best to try and avoid swqueue getting in the way of
> > select_task_rq_fair() (at least to start out with), but we could always
> > remove that and run other experiments to see how it does.
>
> So, part of the problem is that you might be hooking at the wrong level
> (see also my earlier email about the queue only containing running tasks
> and none of the ready tasks).
>
> If you were to hook at the __{en,de}queue_entity() level you'll
> automagically dequeue running tasks and enqueue any ready tasks. OTOH
> you get to deal with the problem that not all entities are tasks, but
> that shouldn't be too hard.
>
> Also, you end up with more enqueue/dequeue -- so that might hurt.

If this doesn't cause untenable contention then it seems preferable if
the overall goal is work conservation. Using your shard idea would
hopefully mitigate that concern somewhat as well. I'll try this out for
v2 and compare it to keeping the enqueues to only be on the wakeup path.

If there's too much contention we can always just go back to the wakeup
case, but I'm hopeful this works out.

> Also, at this point you're nr_cpus/N away from simply scanning the
> runqueues and pulling things from non-empty runqueues. But for large
> nr_cpus and smallish N it might be a win over-all. Dunno..

Given that we're not seeing any contention on smaller LLCs but still
observing perf wins, I'm optimistic.

Thanks,
David