Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Jirka Hladky
Date: Thu Mar 12 2020 - 08:17:56 EST


Hi Mel,

thanks a lot for analyzing it!

My big concern is that the performance drop for low threads counts
(roughly up to 2x number of NUMA nodes) is not just a rare corner
case, but it might be more common. We see the drop for the following
benchmarks/tests, especially on 8 NUMA nodes servers. However, four
and even 2 NUMA node servers are affected as well.

Numbers show average performance drop (median of runtime collected
from 5 subsequential runs) compared to vanilla kernel.

2x AMD 7351 (EPYC Naples), 8 NUMA nodes
===================================
NAS: sp_C test: -50%, peak perf. drop with 8 threads
NAS: mg_D: -10% with 16 threads
SPECjvm2008: co_sunflow test: -20% (peak drop with 8 threads)
SPECjvm2008: compress and cr_signverify tests: -10%(peak drop with 8 threads)
SPECjbb2005: -10% for 16 threads

4x INTEL Xeon GOLD-6126 with Sub-NUMA clustering enabled, 8 NUMA nodes
=============================================================
NAS: sp_C test: -35%, peak perf. drop with 16 threads
SPECjvm2008: co_sunflow, compress and cr_signverify tests: -10%(peak
drop with 8 threads)
SPECjbb2005: -10% for 24 threads

So far, I have run only a limited number of our tests. I can run our
full testing suite next week when required. Please let me know.

Thanks!
Jirka


On Thu, Mar 12, 2020 at 10:54 AM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Mar 09, 2020 at 08:36:25PM +0000, Mel Gorman wrote:
> > > The actual data reports are on an intranet web page so they are harder to
> > > share. I can create PDFs or screenshots but I didn't want to just blast
> > > those to the list. I'd be happy to send some direclty if you are interested.
> > >
> >
> > Send them to me privately please.
> >
> > > Some data in text format I can easily include shows imbalances across the
> > > numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
> > > pull and see the data in text. The regressions can be seen in other tests
> > > as well.
> > >
> >
> > What was the value for x?
> >
> > I ask because I ran NAS across a variety of machines for C class in two
> > configurations -- one using as many CPUs as possible and one running
> > with a third of the available CPUs for both MPI and OMP. Generally there
> > were small gains and losses across multiple kernels but often within the
> > noise or within a few percent of each other.
> >
>
> On re-examining the case, this pattern matches. There are some corner cases
> for large machines that have low utilisation that are obvious. With the
> old behaviour, load balancing would even load evenly all available NUMA
> nodes while NUMA balancing would constantly adjust it for locality. The
> old load balancer does this even if a task starts with all of its memory
> local to one node.
>
> The degree where it causes the most problems appears to be roughly for
> task counts lower than 2 * NR_NODES as per the small imbalance allowed by
> adjust_numa_imbalance but the actual distribution is variable. It's not
> always 2 per node, sometimes it can be a little higher depending on when
> idle balancing happens and other machine activity. This is not universal
> as other machine sizes and workloads are fine with the new behaviour and
> generally benefit.
>
> The problem is particularly visible when the only active tasks in the
> system have set numa_preferred_nid because as far as the load balancer and
> NUMA balancer is concerned, there is no reason to force the SP workload
> to spread wide.
>
> > The largest machine I had available was 4 sockets.
> >
> > The other curiousity is that you used C class. On bigger machines, that
> > is very short lived to the point of being almost useless. Is D class
> > similarly affected?
> >
>
> I expect D class to be similarly affected because the same pattern holds
> -- tasks say on CPUs local to their memory even though more memory
> bandwidth may be available on remote nodes.
>
> > > 5.6.0_rc3.tip_lb_numa+
> > > sp.C.x_008_02 - CPU load average across the individual NUMA nodes
> > > (timestep is 5 seconds)
> > > # NUMA | AVR | Utilization over time in percentage
> > > 0 | 5 | 12 9 3 0 0 11 8 0 1 3 5 17 9 5 0 0 0 11 3
> > > 1 | 16 | 20 21 10 10 2 6 9 12 11 9 9 23 24 23 24 24 24 19 20
> > > 2 | 21 | 19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
> > > 3 | 15 | 19 23 20 21 21 15 15 20 20 18 10 10 9 9 9 9 9 9 11
> > > 4 | 19 | 13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
> > > 5 | 3 | 0 2 11 6 20 8 0 0 0 0 0 0 0 0 0 0 0 0 9
> > > 6 | 0 | 0 0 0 5 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
> > > 7 | 4 | 0 0 0 0 0 0 4 11 9 0 0 0 0 5 12 12 12 3 0
> > >
> > > 5.6.0-0.rc3.1.elrdy
> > > sp.C.x_008_01 - CPU load average across the individual NUMA nodes
> > > (timestep is 5 seconds)
> > > # NUMA | AVR | Utilization over time in percentage
> > > 0 | 13 | 6 8 10 10 11 10 18 13 20 17 14 15
> > > 1 | 11 | 10 10 11 11 9 16 12 14 9 11 11 10
> > > 2 | 17 | 25 19 16 11 13 12 11 16 17 22 22 16
> > > 3 | 21 | 21 22 22 23 21 23 23 21 21 17 22 21
> > > 4 | 14 | 20 23 11 12 15 18 12 10 9 13 12 18
> > > 5 | 4 | 0 0 8 10 7 0 8 2 0 0 8 2
> > > 6 | 1 | 0 5 1 0 0 0 0 0 0 1 0 0
> > > 7 | 7 | 7 3 10 10 10 11 3 8 10 4 0 5
> > >
> >
> > A critical difference with the series is that large imbalances shouldn't
> > happen but prior to the series the NUMA balancing would keep trying to
> > move tasks to a node with load balancing moving them back. That should
> > not happen any more but there are cases where it's actually faster to
> > have the fight between NUMA balancing and load balancing. Ideally a
> > degree of imbalance would be allowed but I haven't found a way of doing
> > that without side effects.
> >
>
> So this is what's happening -- at low utilisation, tasks are staying local
> to their memory. For a lot of cases, this is a good thing -- communicating
> tasks stay local for example and tasks that are not completely memory
> bound benefit. Machines that have sufficient local memory bandwidth also
> appear to benefit.
>
> sp.C appears to be a significant corner case when the degree of
> parallelisation is lower than the number of NUMA nodes in the system
> and of the NAS workloads, bt is also mildly affected. In each cases,
> memory was almost completely local and there was low NUMA activity but
> performance suffered. This is the BT case;
>
> 5.6.0-rc3 5.6.0-rc3
> vanilla schedcore-20200227
> Min bt.C 176.05 ( 0.00%) 185.03 ( -5.10%)
> Amean bt.C 178.62 ( 0.00%) 185.54 * -3.88%*
> Stddev bt.C 4.26 ( 0.00%) 0.60 ( 85.95%)
> CoeffVar bt.C 2.38 ( 0.00%) 0.32 ( 86.47%)
> Max bt.C 186.09 ( 0.00%) 186.48 ( -0.21%)
> BAmean-50 bt.C 176.18 ( 0.00%) 185.08 ( -5.06%)
> BAmean-95 bt.C 176.75 ( 0.00%) 185.31 ( -4.84%)
> BAmean-99 bt.C 176.75 ( 0.00%) 185.31 ( -4.84%)
>
> Note the spread in performance. tip/sched/core looks worse than average but
> its coefficient of variance was just 0.32% versus 2.38% with the vanilla
> kernel. The vanilla kernel is a lot less stable in terms of performance
> due to the fighting between CPU Load and NUMA Balancing.
>
> A heatmap of the CPU usage per LLC showed 4 tasks running on 2 nodes
> with two nodes idle -- there was almost no other system activity that
> would allow the load balancer to balance on tasks that are unconcerned
> with locality. The vanilla case was interesting -- of the 5 iterations,
> 4 spread with one task on 4 nodes but one iteration stacked 4 tasks on
> 2 nodes so it's not even consistent. The NUMA activity looked like this
> for the overall workload.
>
> Ops NUMA alloc hit 3450166.00 2406738.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 1047975.00 41131.00
> Ops NUMA base-page range updates 15864254.00 16283456.00
> Ops NUMA PTE updates 15148478.00 15563584.00
> Ops NUMA PMD updates 1398.00 1406.00
> Ops NUMA hint faults 15128332.00 15535357.00
> Ops NUMA hint local faults % 12253847.00 14471269.00
> Ops NUMA hint local percent 81.00 93.15
> Ops NUMA pages migrated 993033.00 4.00
> Ops AutoNUMA cost 75771.58 77790.77
>
> PTE hinting was more or less the same but look at the locality. 81%
> local for the baseline vanilla kernel and 93.15% for what's in
> tip/sched/core. The baseline kernel migrates almost 1 million pages over
> 15 minutes (5 iterations) and tip/sched/core migrates ... 4 pages.
>
> Looking at the faults over time, the baseline kernel initially faults
> with pages local, drops to 80% shortly after starting and then starts
> climbing back up again as pages get migrated. Initially the number of
> hints the baseline kernel traps is extremely high and drops as pages
> migrate
>
> Most others were almost neutral with the impact of the series more
> obvious in some than others. is.C is really short-lived for example but
> locality of faults went from 43% to 95% local for example.
>
> sp.C was by far the most obvious impact
>
> 5.6.0-rc3 5.6.0-rc3
> vanilla schedcore-20200227
> Min sp.C 141.52 ( 0.00%) 173.61 ( -22.68%)
> Amean sp.C 141.87 ( 0.00%) 174.00 * -22.65%*
> Stddev sp.C 0.26 ( 0.00%) 0.25 ( 5.06%)
> CoeffVar sp.C 0.18 ( 0.00%) 0.14 ( 22.59%)
> Max sp.C 142.10 ( 0.00%) 174.25 ( -22.62%)
> BAmean-50 sp.C 141.59 ( 0.00%) 173.79 ( -22.74%)
> BAmean-95 sp.C 141.81 ( 0.00%) 173.93 ( -22.65%)
> BAmean-99 sp.C 141.81 ( 0.00%) 173.93 ( -22.65%)
>
> That's a big hit in terms of performance and it looks less
> variable. Looking at the NUMA stats
>
> Ops NUMA alloc hit 3100836.00 2161667.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 915700.00 98531.00
> Ops NUMA base-page range updates 12178032.00 13483382.00
> Ops NUMA PTE updates 11809904.00 12792182.00
> Ops NUMA PMD updates 719.00 1350.00
> Ops NUMA hint faults 11791912.00 12782512.00
> Ops NUMA hint local faults % 9345987.00 11467427.00
> Ops NUMA hint local percent 79.26 89.71
> Ops NUMA pages migrated 871805.00 21505.00
> Ops AutoNUMA cost 59061.37 64007.35
>
> Note the locality -- 79.26% to 89.71% but the vanilla kernel migrated 871K
> pages and the new kernel migrates 21K. Looking at migrations over time,
> I can see that the vanilla kernel migrates 180K pages in the first 10
> seconds of each iteration while tip/sched/core migrated few enough that
> it's not even clear on the graph. The workload was long-lived enough that
> the initial disruption was less visible when running for long enough.
>
> The problem is that there is nothing unique that the kernel measures that
> I can think of that uniquely identifies that SP should spread wide and
> migrate early to move its shared pages from other processes that are less
> memory bound or communicating heavily. The state is simply not maintained
> and it cannot be inferred from the runqueue or task state. From both a
> locality point of view and available CPUs, leaving SP alone makes sense
> but we do not detect that memory bandwidth is an issue. In other cases, the
> cost of migrations alone would damage performance and SP is an exception as
> it's long-lived enough to benefit once the first few seconds have passed.
>
> I experimented with a few different approaches but without being able to
> detect the bandwidth, it was a case that SP can be improved but almost
> everything else suffers. For example, SP on 2-socket degrades when spread
> too quickly on machines with enough memory bandwidth so with tip/sched/core
> SP either benefits or suffers depending on the machine. Basic communicating
> tasks degrade 4-8% depending on the machine and exact workload when moving
> back to the vanilla kernel and that is fairly universal AFAIS.
>
> So I think that the new behaviour generally is more sane -- do not
> excessively fight between memory and CPU balancing but if there are
> suggestions on how to distinguish between tasks that should spread wide
> and evenly regardless of initial memory locality then I'm all ears.
> I do not think migrating like crazy hoping it happens to work out and
> having CPU Load and NUMA Balancing using very different criteria for
> evaluation is a better approach.
>
> --
> Mel Gorman
> SUSE Labs
>


--
-Jirka