Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues

From: David Vernet
Date: Thu Aug 31 2023 - 15:14:52 EST


On Thu, Aug 31, 2023 at 06:45:11PM +0800, Chen Yu wrote:
> On 2023-08-30 at 19:01:47 -0500, David Vernet wrote:
> > On Wed, Aug 30, 2023 at 02:17:09PM +0800, Chen Yu wrote:
> > >
> > > 5. Check the L2 cache miss rate.
> > > perf stat -e l2_rqsts.references,l2_request.miss sleep 10
> > > The results show that the L2 cache miss rate is nearly the same with/without
> > > shared_runqueue enabled.
> >
> > As mentioned below, I expect it would be interesting to also collect
> > icache / iTLB numbers. In my experience, poor uop cache locality will
> > also result in poor icache locality, though of course that depends on a
> > lot of other factors like alignment, how many (un)conditional branches
> > you have within some byte window, etc. If alignment, etc were the issue
> > though, we'd likely observe this also without SHARED_RUNQ.
> >
>
> [snip...]
>
> >
> > Interesting. As mentioned above, I expect we also see an increase in
> > iTLB and icache misses?
> >
>
> This is a good point, according to the perf topdown:
>
> SHARED_RUNQ is disabled:
>
> 13.0 % tma_frontend_bound
> 6.7 % tma_fetch_latency
> 0.3 % tma_itlb_misses
> 0.7 % tma_icache_misses
>
> itlb miss ratio is 13.0% * 6.7% * 0.3%
> icache miss ratio is 13.0% * 6.7% * 0.7%
>
> SHARED_RUNQ is enabled:
> 20.0 % tma_frontend_bound
> 11.6 % tma_fetch_latency
> 0.9 % tma_itlb_misses
> 0.5 % tma_icache_misses
>
> itlb miss ratio is 20.0% * 11.6% * 0.9%
> icache miss ratio is 20.0% * 11.6% * 0.5%
>
> So both icache and itlb miss ratio increase, and itlb miss increases more,
> although the bottleneck is neither icache nor itlb.
> And as you mentioned below, it depends on other factors, including the hardware
> settings, icache size, tlb size, DSB size, eg.

Thanks for collecting these stats. Good to know that things are making
sense and the data we're collecting are telling a consistent story.

> > This is something we deal with in HHVM. Like any other JIT engine /
> > compiler, it is heavily front-end CPU bound, and has very poor icache,
> > iTLB, and uop cache locality (also lots of branch resteers, etc).
> > SHARED_RUNQ actually helps this workload quite a lot, as explained in
> > the cover letter for the patch series. It makes sense that it would: uop
> > locality is really bad even without increasing CPU util. So we have no
> > reason not to migrate the task and hop on a CPU.
> >
>
> I see, this makes sense.
>
> > > I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
> > > say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
> > > on the per-LLC shared runqueue? In this way it will not be migrated too offen
> > > so as to keep its locality(both in terms of L1/L2 cache and DSB).
> >
> > I'm hesitant to apply such heuristics to the feature. As mentioned
> > above, SHARED_RUNQ works very well on HHVM, despite its potential hit to
> > icache / iTLB / DSB locality. Those hhvmworker tasks run for a very long
> > time, sometimes upwards of 20+ms. They also tend to have poor L1 cache
> > locality in general even when they're scheduled on the same core they
> > were on before they were descheduled, so we observe better performance
> > if the task is migrated to a fully idle core rather than e.g. its
> > hypertwin if it's available. That's not something we can guarantee with
> > SHARED_RUNQ, but it hopefully illustrates the point that it's an example
> > of a workload that would suffer with such a heuristic.
> >
>
> OK, the hackbench is just a microbenchmark to help us evaluate
> what the impact SHARED_RUNQ could bring. If such heuristic heals
> hackbench but hurts the real workload then we can consider
> other direction.
>
> > Another point to consider is that performance implications that are a
> > result of Intel micro architectural details don't necessarily apply to
> > everyone. I'm not as familiar with the instruction decode pipeline on
> > AMD chips like Zen4. I'm sure they have a uop cache, but the size of
> > that cache, alignment requirements, the way that cache interfaces with
> > e.g. their version of the MITE / decoder, etc, are all going to be quite
> > different.
> >
>
> Yes, this is true.
>
> > In general, I think it's difficult for heuristics like this to suit all
> > possible workloads or situations (not that you're claiming it is). My
> > preference is to keep it as is so that it's easier for users to build a
> > mental model of what outcome they should expect if they use the feature.
> > Put another way: As a user of this feature, I'd be a lot more surprised
> > to see that I enabled it and CPU util stayed low, vs. enabling it and
> > seeing higher CPU util, but also degraded icache / iTLB locality.
> >
>
> Understand.
>
> > Let me know what you think, and thanks again for investing your time
> > into this.
> >
>
> Let me run other benchmarks to see if others are sensitive to
> the resource locality.

Great, thank you Chenyu.

FYI, I'll be on vacation for over a week starting later today, so my
responses may be delayed.

Thanks in advance for working on this. Looking forward to seeing the
results when I'm back at work.

Thanks,
David