Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: Roman Gushchin
Date: Tue Jun 20 2023 - 17:38:17 EST


On Tue, Jun 20, 2023 at 02:54:23PM -0500, David Vernet wrote:
> On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote:
> > On Tue, 13 Jun 2023 at 07:20, David Vernet <void@xxxxxxxxxxxxx> wrote:
> > >
> > > Overview
> > > ========
> > >
> > > The scheduler must constantly strike a balance between work
> > > conservation, and avoiding costly migrations which harm performance due
> > > to e.g. decreased cache locality. The matter is further complicated by
> > > the topology of the system. Migrating a task between cores on the same
> > > LLC may be more optimal than keeping a task local to the CPU, whereas
> > > migrating a task between LLCs or NUMA nodes may tip the balance in the
> > > other direction.
> > >
> > > With that in mind, while CFS is by and large mostly a work conserving
> > > scheduler, there are certain instances where the scheduler will choose
> > > to keep a task local to a CPU, when it would have been more optimal to
> > > migrate it to an idle core.
> > >
> > > An example of such a workload is the HHVM / web workload at Meta. HHVM
> > > is a VM that JITs Hack and PHP code in service of web requests. Like
> > > other JIT / compilation workloads, it tends to be heavily CPU bound, and
> > > exhibit generally poor cache locality. To try and address this, we set
> > > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
> > >
> > > - migration_cost_ns -> 0
> > > - latency_ns -> 20000000
> > > - min_granularity_ns -> 10000000
> > > - wakeup_granularity_ns -> 12000000
> > >
> > > These knobs are intended both to encourage the scheduler to be as work
> > > conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> > > running for relatively long time slices so as to avoid the overhead of
> > > context switching (the other knobs). Collectively, these knobs provide a
> > > substantial performance win; resulting in roughly a 20% improvement in
> > > throughput. Worth noting, however, is that this improvement is _not_ at
> > > full machine saturation.
> > >
> > > That said, even with these knobs, we noticed that CPUs were still going
> > > idle even when the host was overcommitted. In response, we wrote the
> > > "shared wakequeue" (swqueue) feature proposed in this patch set. The
> > > idea behind swqueue is simple: it enables the scheduler to be
> > > aggressively work conserving by placing a waking task into a per-LLC
> > > FIFO queue that can be pulled from by another core in the LLC FIFO queue
> > > which can then be pulled from before it goes idle.
> >
> > This seems to be just another newly idle load balance outside the current one !
>
> Hi Vincent,
>
> I can bring the swqueue logic inside of newidle_balance(). In hindsight
> I think it makes more sense there.
>
> To answer your point more generally though, yes, this is a new approach
> to load balancing that eschews tracking migration costs, scanning
> runqueues, etc in favor of optimizing for work conservation, and
> tracking runnable tasks in a shared data structure. More on this below
> in response to your other points.
>
> >
> > The knobs above are not the only thing preventing a rq to pull a new
> > task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
> > stuff which might be one main root cause for one of your cpu not
> > pulling a waiting task
> >
> > It's not clear in your explanation why fixing newly_idle_load_balance
> > was not possible instead of adding outside code and what prevents
> > newly_idle_load balance from picking a task in your case ?
> >
> > For example, have you tried to disable the early break because of avg_idle ?
>
> The goal of swqueue is to enable work conservation using a shared, per-LLC data
> structure. The shared data structure is really the salient point as to why just
> updating newidle_balance() wouldn't achieve the same result. newidle_balance()
> finds tasks to migrate by (sometimes) iterating over runqueues and scanning for
> tasks. It's an expensive operation, which doesn't scale to large machines or to
> being performed on every idle path. swqueue, on the other hand, is shared
> across all cores in an LLC, so pulling a runnable task is simply a matter of a
> spinlock acquire and a list operation. This doesn't scale to every single
> configuration as Aaron pointed out in [0], but it works well in many other
> configurations.

+1

Additionally, swqueue minimizes the number of races when two tasks pick
the same cpu for being woken up.
It's also serializing the wakeup order, which effectively pessimises tasks
which are woken very often and promotes cpu-bound tasks, which usually
positively affects the throughput.

I agree that swqueue can be seen as an alternative form of the idle load balancing,
I thought this way when I was working on it.
Can it replace it completely? Idk, maybe. But maybe we need some sort of a balancing
between multiple wait queues. E.g. if there are two swqueues on the system, one is empty
and another is long, load idle balancing can borrow some tasks from the "foreign" swqueue.
Just some ideas.

Thanks!