Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS

From: David Vernet
Date: Tue Jul 11 2023 - 17:33:51 EST


On Tue, Jul 11, 2023 at 01:42:07PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> > Difference between shared_runq and SIS_NODE
> > ===========================================
> >
> > In [0] Peter proposed a patch that addresses Tejun's observations that
> > when workqueues are targeted towards a specific LLC on his Zen2 machine
> > with small CCXs, that there would be significant idle time due to
> > select_idle_sibling() not considering anything outside of the current
> > LLC.
> >
> > This patch (SIS_NODE) is essentially the complement to the proposal
> > here. SID_NODE causes waking tasks to look for idle cores in neighboring
> > LLCs on the same die, whereas shared_runq causes cores about to go idle
> > to look for enqueued tasks. That said, in its current form, the two
> > features at are a different scope as SIS_NODE searches for idle cores
> > between LLCs, while shared_runq enqueues tasks within a single LLC.
> >
> > The patch was since removed in [1], and we compared the results to
> > shared_runq (previously called "swqueue") in [2]. SIS_NODE did not
> > outperform shared_runq on any of the benchmarks, so we elect to not
> > compare against it again for this v2 patch set.
>
> Right, so SIS is search-idle-on-wakeup, while you do
> search-task-on-newidle, and they are indeed complentary actions.
>
> As to SIS_NODE, I really want that to happen, but we need a little more
> work for the Epyc things, they have a few too many CCXs per node :-)
>
> Anyway, the same thing that moticated SIS_NODE should also be relevant
> here, those Zen2 things have only 3/4 cores per LLC, would it not also
> make sense to include multiple of them into the shared runqueue thing?

It's probably worth experimenting with this, but it would be workload
dependent on whether this would help or hurt. I would imagine there are
workloads where having a single shared runq for the whole socket is
advantageous even for larger LLCs like on Milan. But for many use cases
(including e.g. HHVM), the cache-line bouncing makes it untenable.

But yes, if we deem SIS_NODE to be useful for small CCXs like Zen2, I
don't see any reason to not apply that to shared_runq as well. I don't
have a Zen2 but I'll prototype this idea and hopefully can get some help
from Tejun or someone else to run some benchmarks on it.

> (My brain is still processing the shared_runq name...)

Figured this would be the most contentious part of v2.