Re: [RFC PATCH 3/3] sched/fair: Add a per-shard overload flag

From: David Vernet
Date: Tue Oct 03 2023 - 17:05:21 EST


On Wed, Sep 27, 2023 at 02:59:29PM +0800, Chen Yu wrote:
> Hi Prateek,

Hi Chenyu,

> On 2023-09-27 at 09:53:13 +0530, K Prateek Nayak wrote:
> > Hello David,
> >
> > Some more test results (although this might be slightly irrelevant with
> > next version around the corner)
> >
> > On 9/1/2023 12:41 AM, David Vernet wrote:
> > > On Thu, Aug 31, 2023 at 04:15:08PM +0530, K Prateek Nayak wrote:
> > >
> > -> With EEVDF
> >
> > o tl;dr
> >
> > - Same as what was observed without EEVDF but shared_runq shows
> > serious regression with multiple more variants of tbench and
> > netperf now.
> >
> > o Kernels
> >
> > eevdf : tip:sched/core at commit b41bbb33cf75 ("Merge branch 'sched/eevdf' into sched/core")
> > shared_runq : eevdf + correct time accounting with v3 of the series without any other changes
> > shared_runq_idle_check : shared_runq + move the rq->avg_idle check before peeking into the shared_runq
> > (the rd->overload check still remains below the shared_runq access)
> >
>
> I did not see any obvious regression on a Sapphire Rapids server and it seems that
> the result on your platform suggests that C/S workload could be impacted
> by shared_runq. Meanwhile some individual workloads like HHVM in David's environment
> (no shared resource between tasks if I understand correctly) could benefit from

Correct, hhvmworkers are largely independent, though they do sometimes
synchronize, and they also sometimes rely on I/O happening in other
tasks.

> shared_runq a lot. This makes me wonder if we can let shared_runq skip the C/S tasks.

I'm also open to this possibility, but I worry that we'd be going down
the same rabbit hole as what fair.c does already, which is use
heuristics to determine when something should or shouldn't be migrated,
etc. I really do feel that there's value in SHARED_RUNQ providing
consistent and predictable work conservation behavior.

On the other hand, it's clear that there are things we can do to improve
performance for some of these client/server workloads that hammer the
runqueue on larger CCXs / sockets. If we can avoid those regressions
while still having reasonably high confidence that work conservation
won't disproportionately suffer, I'm open to us making some tradeoffs
and/or adding a bit of complexity to avoid some of this unnecessary
contention.

I think it's probably about time for v4 to be sent out. What do you
folks think about including:

1. A few various fixes / tweaks from v3, e.g. avoiding using the wrong
shard on the task_dead_fair() path if the feature is disabled before
a dying task is dequeued from a shard, fixing the build issues
pointed out by lkp, etc.
2. Fix the issue that Prateek pointed out in [0] where we're not
properly skipping the LLC domain due to using the for_each_domain()
macro (this is also addressed by (3)).
3. Apply Prateek's suggestions (in some form) in [1] and [2]. For [2],
I'm inclined to just avoid enqueuing a task on a shard if the rq it's on
has nr_running == 0. Or, we can just add his patch to the series
directly if it turns out that just looking at rq->nr_running is
insufficient.

[0]: https://lore.kernel.org/all/3e32bec6-5e59-c66a-7676-7d15df2c961c@xxxxxxx/
[1]: https://lore.kernel.org/all/20230831104508.7619-3-kprateek.nayak@xxxxxxx/
[2]: https://lore.kernel.org/all/20230831104508.7619-4-kprateek.nayak@xxxxxxx/

Prateek -- what do you think about this? I want to make sure you get
credit for your contributions to this series, so let me know how you'd
like to apply these changes. [1] essentially just improves much of the
logic from [3], so I'm not sure it would make sense to include it as a
separate patch. I'm happy to include a Co-authored-by tag, or to just
explicitly credit your contributions in the commit summary if you'd
prefer that.

[3]: https://lore.kernel.org/all/20230809221218.163894-7-void@xxxxxxxxxxxxx/

Thanks,
David