Re: [PATCHSET v4] sched: Implement BPF extensible scheduler class

From: Tejun Heo
Date: Thu Jul 27 2023 - 20:12:14 EST


Hello, Peter.

On Wed, Jul 26, 2023 at 11:17:52AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 21, 2023 at 08:37:41AM -1000, Tejun Heo wrote:
> > We are comfortable with the current API. Everything we tried fit pretty
> > well. It will continue to evolve but sched_ext now seems mature enough for
> > initial inclusion. I suppose lack of response doesn't indicate tacit
> > agreement from everyone, so what are you guys all thinking?
>
> I'm still hating the whole thing with a passion.
>
> As can be seen from the wide-spread SCHED_DEBUG abuse; people are, in
> general, not interested in doing the right thing. They prod random
> numbers (as in really, some are just completely insane) until their
> workload improves and call it a day.

I think it'd be useful to add some details to what's going on in situations
like above. This of course wouldn't apply directly to everyone but I suspect
many will recognize at least some parts of it.

In many production setups, there are aspects of workload behaviors that are
difficult to understand comprehensively. The workloads are often massively
complex, constantly being developed by many people, and dynamically
interacting with external entities. As with any sufficiently complex system,
there are many emergent properties which are difficult to untangle
completely.

Add to that multiple generations of divergent hardware and most of the
software stack coming from third parties (including kernel from application
team's POV), people often and justifiably feel as if they're swimming in the
sea of black boxes and emergent properties.

Scheduling, naturally, is one of the areas that people look into when trying
to optimize system performance. Vast majority of people don't know scheduler
code base well enough to hack on it. Even when they do, it's often not easy
to set up benchmarks in production environments and cycle through different
kernels. We (Meta) are a lot better now than a couple years ago, but even
now swapping kernels and ramping workloads back up can take a long time for
certain workloads.

Given the circumstances, it's not surprising that people go for tunable
knobs when they're trying to find out whether changing scheduling behaviors
would improve performance for their workloads. That's often the only option
available and tuning the knobs frequently leads to some gains. Most people
aren't scheduling experts and the causal relationships between changes and
results may not be direct or intuitive. So, that's often where things end.
Given that nobody has found scheduling behavior which is optimal for every
workload and the SCHED_DEBUG knobs are what people can access, it is an
expected outcome.

If a consistent pattern is repeated across multiple workloads, we can
sometimes work back why tuning certain way makes sense and generalize that,
which is to some degree how we ended up focusing on recent work-conservation
related projects.

Maybe the situation is not ideal but I don't think it's people not being
interested in doing the right thing. They are doing what they can within the
confines of available mechanisms, expertise, and time & effort they can
afford to invest.

One of the impediments when trying to connect these disparate data points
into something meaningful is the difficulty in experimentation. The trials
are confined to whatever combinations that can be achieved with SCHED_DEBUG
knobs which are both limiting and obscuring. I believe we're a lot more
likely to learn more about scheduling with sched_ext widely available than
without as it would allow easier and wider-in-scope experimentations.

> There is not a single doubt in my mind that if I were to merge this,
> there will be Enterprise software out there that will mandate its own
> BPF sched thing, or else it won't work.
>
> They will not care, they will not contribute, they might even pull a
> RedHat and only share the code to customers.

I'm sure some will behave in a way which isn't the most conducive to
collective improvement of the upstream kernel. That said, I don't see how
this will be noticeably worsened by inclusion of sched_ext. Most mobile
kernels and some production kernels in cloud environments already carry
significant custom modifications, and they're often addressing real problems
for their use cases.

It'd be ideal if everyone had the commitment and bandwidth to try their best
to merge back their changes but it's also understandable why that can't
always be the case. Sometimes, it's too specific or underdeveloped. At other
times, time and resources just aren't there. We can incentivize and coerce
but that can be pushed only so far. However, we do have an a lot easier time
learning about what people are doing thanks to GPL which all sched_ext
programs would need to follow exactly like the rest of the kernel.

At least relatively speaking, scheduling doesn't seem like an area which is
particularly starved for developer bandwidth although one can always hope
for more. Actual insights and an easy way to experiment and collaborate to
discover them seem like a bigger bottleneck. Hopefully, sched_ext will widen
the scope of things that people will try. Even when they don't directly
contribute those changes back to CFS, if a strategy is effective and general
enough, others can learn from them and apply to improve scheduling for
everyone.

Both Meta and Google are committed to sharing what we learn, both in terms
of code and insights. The example schedulers in the posting are all we
(Meta) have been experimenting with except for really hacky soft affinity
trials which will be generalized and shared too. David has also been
actively working to apply the shared runqueue changes to CFS which came from
earlier sched_ext experiments. Google has been open-sourcing their ghOSt
framework and schedulers built on top of it which will be ported to
sched_ext in the future. Google is starting to see promising results with
search and will share their findings in code and through other venues
including conferences.

> We all loose in that scenario. Not least me, because I get the
> additional maintenance burden.

sched_ext isn't that invasive to the core code and its interactions with
other scheduling classes are very limited. This would make changing
scheduling core API a bit more burdensome but they have been relatively
stable and both David and I would be on the hook if anything is in your way.
I don't see why this would significantly increase your maintenance burden.
It's a thing but it's a thing in its own corner.

> I also don't see upsides to merging this. You all can play with
> schedulers out-of-tree just fine and then submit what actually works.

There is a huge difference between having a common framework upstream and
not having one. If in kernel, everyone knows that it's widely available and
will remain so for a very long time. It removes the risk of investing energy
and effort into something which may or may not exist next year.

It also has the standardizing effect where different parties can exchange
code and ideas easily. It's so much more effective to be able to directly
build upon other people's work than trying to reimplement everything on your
own or navigate maze of different frameworks and patches with different
baseline kernel versions and so on. I mean, these are the reasons that we
want things upstreamed, right?

Thanks.

--
tejun