Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues

From: David Vernet
Date: Thu Aug 31 2023 - 15:14:52 EST

Next message: Kees Cook: "Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs"
Previous message: Kees Cook: "[PATCH v3] kbuild: Show marked Kconfig fragments in "help""
In reply to: Chen Yu: "Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Aug 31, 2023 at 06:45:11PM +0800, Chen Yu wrote:
> On 2023-08-30 at 19:01:47 -0500, David Vernet wrote:
> > On Wed, Aug 30, 2023 at 02:17:09PM +0800, Chen Yu wrote:
> > >
> > > 5. Check the L2 cache miss rate.
> > > perf stat -e l2_rqsts.references,l2_request.miss sleep 10
> > > The results show that the L2 cache miss rate is nearly the same with/without
> > > shared_runqueue enabled.
> >
> > As mentioned below, I expect it would be interesting to also collect
> > icache / iTLB numbers. In my experience, poor uop cache locality will
> > also result in poor icache locality, though of course that depends on a
> > lot of other factors like alignment, how many (un)conditional branches
> > you have within some byte window, etc. If alignment, etc were the issue
> > though, we'd likely observe this also without SHARED_RUNQ.
> >
>
> [snip...]
>
> >
> > Interesting. As mentioned above, I expect we also see an increase in
> > iTLB and icache misses?
> >
>
> This is a good point, according to the perf topdown:
>
> SHARED_RUNQ is disabled:
>
> 13.0 % tma_frontend_bound
> 6.7 % tma_fetch_latency
> 0.3 % tma_itlb_misses
> 0.7 % tma_icache_misses
>
> itlb miss ratio is 13.0% * 6.7% * 0.3%
> icache miss ratio is 13.0% * 6.7% * 0.7%
>
> SHARED_RUNQ is enabled:
> 20.0 % tma_frontend_bound
> 11.6 % tma_fetch_latency
> 0.9 % tma_itlb_misses
> 0.5 % tma_icache_misses
>
> itlb miss ratio is 20.0% * 11.6% * 0.9%
> icache miss ratio is 20.0% * 11.6% * 0.5%
>
> So both icache and itlb miss ratio increase, and itlb miss increases more,
> although the bottleneck is neither icache nor itlb.
> And as you mentioned below, it depends on other factors, including the hardware
> settings, icache size, tlb size, DSB size, eg.

Thanks for collecting these stats. Good to know that things are making
sense and the data we're collecting are telling a consistent story.

> > This is something we deal with in HHVM. Like any other JIT engine /
> > compiler, it is heavily front-end CPU bound, and has very poor icache,
> > iTLB, and uop cache locality (also lots of branch resteers, etc).
> > SHARED_RUNQ actually helps this workload quite a lot, as explained in
> > the cover letter for the patch series. It makes sense that it would: uop
> > locality is really bad even without increasing CPU util. So we have no
> > reason not to migrate the task and hop on a CPU.
> >
>
> I see, this makes sense.
>
> > > I wonder, if SHARED_RUNQ can consider that, if a task is a long duration one,
> > > say, p->avg_runtime >= sysctl_migration_cost, maybe we should not put such task
> > > on the per-LLC shared runqueue? In this way it will not be migrated too offen
> > > so as to keep its locality(both in terms of L1/L2 cache and DSB).
> >
> > I'm hesitant to apply such heuristics to the feature. As mentioned
> > above, SHARED_RUNQ works very well on HHVM, despite its potential hit to
> > icache / iTLB / DSB locality. Those hhvmworker tasks run for a very long
> > time, sometimes upwards of 20+ms. They also tend to have poor L1 cache
> > locality in general even when they're scheduled on the same core they
> > were on before they were descheduled, so we observe better performance
> > if the task is migrated to a fully idle core rather than e.g. its
> > hypertwin if it's available. That's not something we can guarantee with
> > SHARED_RUNQ, but it hopefully illustrates the point that it's an example
> > of a workload that would suffer with such a heuristic.
> >
>
> OK, the hackbench is just a microbenchmark to help us evaluate
> what the impact SHARED_RUNQ could bring. If such heuristic heals
> hackbench but hurts the real workload then we can consider
> other direction.
>
> > Another point to consider is that performance implications that are a
> > result of Intel micro architectural details don't necessarily apply to
> > everyone. I'm not as familiar with the instruction decode pipeline on
> > AMD chips like Zen4. I'm sure they have a uop cache, but the size of
> > that cache, alignment requirements, the way that cache interfaces with
> > e.g. their version of the MITE / decoder, etc, are all going to be quite
> > different.
> >
>
> Yes, this is true.
>
> > In general, I think it's difficult for heuristics like this to suit all
> > possible workloads or situations (not that you're claiming it is). My
> > preference is to keep it as is so that it's easier for users to build a
> > mental model of what outcome they should expect if they use the feature.
> > Put another way: As a user of this feature, I'd be a lot more surprised
> > to see that I enabled it and CPU util stayed low, vs. enabling it and
> > seeing higher CPU util, but also degraded icache / iTLB locality.
> >
>
> Understand.
>
> > Let me know what you think, and thanks again for investing your time
> > into this.
> >
>
> Let me run other benchmarks to see if others are sensitive to
> the resource locality.

Great, thank you Chenyu.

FYI, I'll be on vacation for over a week starting later today, so my
responses may be delayed.

Thanks in advance for working on this. Looking forward to seeing the
results when I'm back at work.

Thanks,
David

Next message: Kees Cook: "Re: [PATCH] lkdtm/bugs: add test for panic() with stuck secondary CPUs"
Previous message: Kees Cook: "[PATCH v3] kbuild: Show marked Kconfig fragments in "help""
In reply to: Chen Yu: "Re: [PATCH v3 7/7] sched: Shard per-LLC shared runqueues"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]