Re: [RFC PATCH 3/3] sched: Implement shared wakequeue in CFS

From: David Vernet
Date: Tue Jun 20 2023 - 15:54:31 EST


On Fri, Jun 16, 2023 at 10:08:57AM +0200, Vincent Guittot wrote:
> On Tue, 13 Jun 2023 at 07:20, David Vernet <void@xxxxxxxxxxxxx> wrote:
> >
> > Overview
> > ========
> >
> > The scheduler must constantly strike a balance between work
> > conservation, and avoiding costly migrations which harm performance due
> > to e.g. decreased cache locality. The matter is further complicated by
> > the topology of the system. Migrating a task between cores on the same
> > LLC may be more optimal than keeping a task local to the CPU, whereas
> > migrating a task between LLCs or NUMA nodes may tip the balance in the
> > other direction.
> >
> > With that in mind, while CFS is by and large mostly a work conserving
> > scheduler, there are certain instances where the scheduler will choose
> > to keep a task local to a CPU, when it would have been more optimal to
> > migrate it to an idle core.
> >
> > An example of such a workload is the HHVM / web workload at Meta. HHVM
> > is a VM that JITs Hack and PHP code in service of web requests. Like
> > other JIT / compilation workloads, it tends to be heavily CPU bound, and
> > exhibit generally poor cache locality. To try and address this, we set
> > several debugfs (/sys/kernel/debug/sched) knobs on our HHVM workloads:
> >
> > - migration_cost_ns -> 0
> > - latency_ns -> 20000000
> > - min_granularity_ns -> 10000000
> > - wakeup_granularity_ns -> 12000000
> >
> > These knobs are intended both to encourage the scheduler to be as work
> > conserving as possible (migration_cost_ns -> 0), and also to keep tasks
> > running for relatively long time slices so as to avoid the overhead of
> > context switching (the other knobs). Collectively, these knobs provide a
> > substantial performance win; resulting in roughly a 20% improvement in
> > throughput. Worth noting, however, is that this improvement is _not_ at
> > full machine saturation.
> >
> > That said, even with these knobs, we noticed that CPUs were still going
> > idle even when the host was overcommitted. In response, we wrote the
> > "shared wakequeue" (swqueue) feature proposed in this patch set. The
> > idea behind swqueue is simple: it enables the scheduler to be
> > aggressively work conserving by placing a waking task into a per-LLC
> > FIFO queue that can be pulled from by another core in the LLC FIFO queue
> > which can then be pulled from before it goes idle.
>
> This seems to be just another newly idle load balance outside the current one !

Hi Vincent,

I can bring the swqueue logic inside of newidle_balance(). In hindsight
I think it makes more sense there.

To answer your point more generally though, yes, this is a new approach
to load balancing that eschews tracking migration costs, scanning
runqueues, etc in favor of optimizing for work conservation, and
tracking runnable tasks in a shared data structure. More on this below
in response to your other points.

>
> The knobs above are not the only thing preventing a rq to pull a new
> task. We have rq->avg_idle, curr_cost and sd->max_newidle_lb_cost
> stuff which might be one main root cause for one of your cpu not
> pulling a waiting task
>
> It's not clear in your explanation why fixing newly_idle_load_balance
> was not possible instead of adding outside code and what prevents
> newly_idle_load balance from picking a task in your case ?
>
> For example, have you tried to disable the early break because of avg_idle ?

The goal of swqueue is to enable work conservation using a shared, per-LLC data
structure. The shared data structure is really the salient point as to why just
updating newidle_balance() wouldn't achieve the same result. newidle_balance()
finds tasks to migrate by (sometimes) iterating over runqueues and scanning for
tasks. It's an expensive operation, which doesn't scale to large machines or to
being performed on every idle path. swqueue, on the other hand, is shared
across all cores in an LLC, so pulling a runnable task is simply a matter of a
spinlock acquire and a list operation. This doesn't scale to every single
configuration as Aaron pointed out in [0], but it works well in many other
configurations.

[0]: https://lore.kernel.org/all/20230614043529.GA1942@ziqianlu-dell/

Another consideration is that even if we could adjust newidle_balance()
to load balance well enough for our specific purpose, we're still
relying on heuristics to determine when it's appropriate to load
balance; and that will

a) Inevitably be suboptimal for certain workloads and configurations. For
example, if we got rid of the following check:

12021 for_each_domain(this_cpu, sd) {
12022 int continue_balancing = 1;
12023 u64 domain_cost;
12024
12025 update_next_balance(sd, &next_balance);
12026
12027 if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
12028 break;

we may end up load balancing too frequently, which could be a regression in and
of itself per the scalability concerns mentioned above. Or, for certain
workloads, we'll load balance too aggressively and miss out on L1/L2 locality.

b) Be harder for users to effectively tune or use. At OSPM, Peter made it quite
clear that users should not be tuning any of the debugfs knobs. Relying on
heuristics and knobs like this feels antithetical to that policy. swqueue feels
like a reasonable, self-contained alternative to that.

Thanks,
David