Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS

From: Gautham R. Shenoy
Date: Fri Aug 18 2023 - 04:49:57 EST

Next message: Suzuki K Poulose: "Re: [PATCH V3 1/3] coresight: etm: Override TRCIDR3.CCITMIN on errata affected cpus"
Previous message: Jesper Dangaard Brouer: "[PATCH] softirq: Adjust comment for CONFIG_PREEMPT_RT in #else"
In reply to: David Vernet: "Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello David,

On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
> On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> > Hello David,
>
> Hello Gautham,
>
> Thanks a lot as always for running some benchmarks and analyzing these
> changes.
>
> > On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> > > Changes
> > > -------
> > >
> > > This is v3 of the shared runqueue patchset. This patch set is based off
> > > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > > bandwidth in use") on the sched/core branch of tip.git.
> >
> >
> > I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> > notice that apart from hackbench, every other bechmark is showing
> > regressions with this patch series. Quick summary of my observations:
>
> Just to verify per our prior conversation [0], was this latest set of
> benchmarks run with boost disabled?

Boost is enabled by default. I will queue a run tonight with boost
disabled.

> Your analysis below seems to
> indicate pretty clearly that v3 regresses some workloads regardless of
> boost, but I did just want to double check regardless just to bear it in
> mind when looking over the results.
>
> [0]: https://lore.kernel.org/all/ZMn4dLePB45M5CGa@xxxxxxxxxxxxxxxxxxxxxx/
>
> > * With shared-runqueue enabled, tbench and netperf both stop scaling
> > when we go beyond 32 clients and the scaling issue persists until
> > the system is overutilized. When the system is overutilized,
> > shared-runqueue is able to recover quite splendidly and outperform
> > tip.
>
> Hmm, I still don't understand why we perform better when the system is
> overutilized. I'd expect vanilla CFS to perform better than shared_runq
> in such a scenario in general, as the system will be fully utilized
> regardless.

My hunch is that when every rq is equally loaded, we perhaps don't
have so much time to spend doing newidle load balance at higher levels
because any idle CPU will likely find a task to pull from the shared
runqueue.

IOW, the shared runqueue feature is generally useful, but we need to
figure out why are we doing excessive load-balance when the system is
moderately utilized.

[..snip..]

> >
> > ==================================================================
> > Test : tbench
> > Units : Normalized throughput
> > Interpretation: Higher is better
> > Statistic : AMean
> > ==================================================================
> > Clients: tip[pct imp](CV) sh_rq_v3[pct imp](CV) sh_rq_v3_tgload_fix[pct imp](CV)
> > 32 1.00 [ 0.00]( 2.90) 0.44 [-55.53]( 1.44) 0.98 [ -2.23]( 1.72)
> > 64 1.00 [ 0.00]( 1.02) 0.27 [-72.58]( 0.35) 0.74 [-25.64]( 2.43)
> > 128 1.00 [ 0.00]( 0.88) 0.19 [-81.29]( 0.51) 0.52 [-48.47]( 3.92)
> > 256 1.00 [ 0.00]( 0.28) 0.17 [-82.80]( 0.29) 0.88 [-12.23]( 1.76)
>
> Just to make sure we're on the same page, "CV" here is the coefficient
> of variation (i.e. standard deviation / mean), correct?

Yes, CV is coefficient of variance : Ratio of the standard deviation
to the mean.

[..snip..]

> > So, I counted the number of times the CPUs call find_busiest_group()
> > without and with shared_rq and the distribution is quite stark.
> >
> > =====================================================
> > per-cpu : Number of CPUs :
> > find_busiest_group :----------------:--------------:
> > count : without-sh.rq : with-sh.rq :
> > =====================================:===============
> > [ 0 - 200000) : 77
> > [ 200000 - 400000) : 41
> > [ 400000 - 600000) : 64
> > [ 600000 - 800000) : 63
> > [ 800000 - 1000000) : 66
> > [1000000 - 1200000) : 69
> > [1200000 - 1400000) : 52
> > [1400000 - 1600000) : 34 5
> > [1600000 - 1800000) : 17 31
> > [1800000 - 2000000) : 6 59
> > [2000000 - 2200000) : 13 109
> > [2200000 - 2400000) : 4 120
> > [2400000 - 2600000) : 3 157
> > [2600000 - 2800000) : 1 29
> > [2800000 - 3000000) : 1 2
> > [9200000 - 9400000) : 1
> >
> > As you can notice, the number of calls to find_busiest_group() without
> > the shared.rq is greater at the lower end of distribution, which
> > implies fewer calls in total. With shared-rq enabled, the distribution
> > is normal, but shifted to the right, which implies a lot more calls to
> > find_busiest_group().
>
> Huh, that's very much unexpected for obvious reasons -- we should be
> hitting the load_balance() path _less_ due to scheduling tasks with the
> shared_runq.

I would like to verify what is the shared_rq hit-ratio when the system
is moderately loaded, while running short-running tasks such as
tbench/netperf. My hunch is that in the moderately loaded case, the
newly idle CPUs are not finding any task in the shared-runqueue.

>
> > To investigate further, where this is coming from, I reran tbench with
> > sched-scoreboard (https://github.com/AMDESE/sched-scoreboard), and the
> > schedstats shows the the total wait-time of the tasks on the runqueue
> > *increases* by a significant amount when shared-rq is enabled.
> >
> > Further, if you notice the newidle load_balance() attempts at the DIE
> > and the NUMA domains, they are significantly higher when shared-rq is
> > enabled. So it appears that a lot more time is being spent trying to
> > do load-balancing when shared runqueue is enabled, which is counter
> > intutitive.
>
> Certainly agreed on it being counter intuitive. I wonder if this is due
> to this change in the latest v3 revision [1]:
>
> @@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> + /*
> + * Skip <= LLC domains as they likely won't have any tasks if the
> + * shared runq is empty.
> + */
> + if (sched_feat(SHARED_RUNQ)) {
> + sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> + if (likely(sd))
> + sd = sd->parent;
> + }
> +
> if (!READ_ONCE(this_rq->rd->overload) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> [1]: https://lore.kernel.org/all/20230809221218.163894-7-void@xxxxxxxxxxxxx/
>
> I originally added this following Peter's suggestion in v2 [2] with the
> idea that we'd skip the <= LLC domains when shared_runq is enabled, but
> in hindsight, we also aren't walking the shared_runq list until we find
> a task that can run on the current core. If there's a task, and it can't
> run on the current core, we give up and proceed to the rest of
> newidle_balance(). So it's possible that we're incorrectly assuming
> there's no task in the current LLC because it wasn't enqueued at the
> head of the shared_runq. I think we'd only want to add this improvement
> if we walked the list as you tried in v1 of the patch set, or revert it
> otherwise.

Yes, the optimization does make sense if we are sure that there are no
tasks to be pulled in the SMT and the MC domain. Since I am not
pinning any tasks, if a newly idle CPU is doing load-balance it is
very likely because the shared-rq is empty. Which implies that the the
SMT and MC domains are not overloaded.

But that also means exploring load-balance at a fairly large DIE
domain much early on. And once we go past the should_we_balance()
check, we bail out only after doing the find_busiest_group()/queue,
which can take quite a bit of time on a DIE domain that spans 256
threads. By this time, if a task was woken up on the CPU, it would
have to wait for the load-balance to complete.

In any case, this can be easily verified by reverting the
optimization.

>
> [2]: https://lore.kernel.org/lkml/20230711094547.GE3062772@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
> Would you be able to try running your benchmarks again with that change
> removed, or with your original idea of walking the list added?

I will queue this for later today.

> I've
> tried to reproduce this issue on my 7950x, as well as a single-socket /
> CCX 26 core / 52 thread Cooper Lake host, but have been unable to. For
> example, if I run funccount.py 'load_balance' -d 30 (from [3]) while
> running the below netperf command on the Cooper Lake, this is what I
> see:
>
> for i in `seq 128`; do netperf -6 -t UDP_RR -c -C -l 60 & done
>
> NO_SHARED_RUNQ
> --------------
> FUNC COUNT
> b'load_balance' 39636
>
>
> SHARED_RUNQ
> -----------
> FUNC COUNT
> b'load_balance' 32345
>
> [3]: https://github.com/iovisor/bcc/blob/master/tools/funccount.py
>
> The stack traces also don't show us running load_balance() excessively.
> Please feel free to share how you're running tbench as well and I can
> try that on my end.

This is how I am running tbench:

# wget https://www.samba.org/ftp/tridge/dbench/dbench-4.0.tar.gz
# tar xvf dbench-4.0.tar.gz
# cd dbench-4.0
# ./autogen.sh
# ./configure
# make
# nohup ./tbench_srv 0 &
# ./tbench -t 60 <nr-clients> -c ./client.txt

[..snip..]

>
> > ==================================================================
> > Test : hackbench
> > Units : Normalized time in seconds
> > Interpretation: Lower is better
> > Statistic : AMean
> > ==================================================================
> > Case: tip[pct imp](CV) sh_rq_v3[pct imp](CV) sh_rq_v3_tgload_fix[pct imp](CV)
> > 1-groups 1.00 [ 0.00]( 8.41) 0.96 [ 3.63]( 6.04) 0.94 [ 6.48]( 9.16)
> > 2-groups 1.00 [ 0.00](12.96) 0.96 [ 4.46]( 9.76) 0.89 [ 11.02]( 8.28)
> > 4-groups 1.00 [ 0.00]( 2.90) 0.85 [ 14.77]( 9.18) 0.86 [ 14.35](13.26)
> > 8-groups 1.00 [ 0.00]( 1.06) 0.91 [ 8.96]( 2.83) 0.94 [ 6.34]( 2.02)
> > 16-groups 1.00 [ 0.00]( 0.57) 1.19 [-18.91]( 2.82) 0.74 [ 26.02]( 1.33)
>
> Nice, this matches what I observed when running my benchmarks as well.

Yes, hackbench seems to benefit from the shared-runqueue patches, more
so with Aaron's ratelimiting patch.

[..snip..]

> > Test : netperf
>
> Could you please share exactly how you're invoking netperf? Is this
> with -t UDP_RR (which is what Aaron was running with), or the default?

No, this is how I am invoking netperf.

Having started the netserver that is listening in on 127.0.0.1,

I run N copies of the following command, where N is the number of clients.

netperf -H 127.0.0.1 -t TCP_RR -l 100 -- -r 100 -k REQUEST_SIZE,RESPONSE_SIZE,ELAPSED_TIME,THROUGHPUT,THROUGHPUT_UNITS,MIN_LATENCY,MEAN_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MAX_LATENCY,STDDEV_LATENCY

>
> [...]
>
> As far as I can tell there weren't many changes between v2 and v3 that
> could have caused this regression. I strongly suspect the heuristic I
> mentioned above, especially with your analysis on us excessively calling
> load_balance().

Sure. I will get back once I have those results.

>
> Thanks,
> David

--
Thanks and Regards
gautham.

Next message: Suzuki K Poulose: "Re: [PATCH V3 1/3] coresight: etm: Override TRCIDR3.CCITMIN on errata affected cpus"
Previous message: Jesper Dangaard Brouer: "[PATCH] softirq: Adjust comment for CONFIG_PREEMPT_RT in #else"
In reply to: David Vernet: "Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]