Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Mel Gorman
Date: Thu Mar 12 2020 - 11:56:48 EST


On Thu, Mar 12, 2020 at 01:10:36PM +0100, Jirka Hladky wrote:
> Hi Mel,
>
> thanks a lot for analyzing it!
>
> My big concern is that the performance drop for low threads counts (roughly
> up to 2x number of NUMA nodes) is not just a rare corner case, but it might
> be more common.

That is hard to tell. In my case I was seeing the problem running a HPC
workload, dedicated to that machine and using only 6% of available CPUs. I
find it unlikely that is common because who acquires such a large machine
and then uses a tiny percentage of it. Rember also the other points I made
such as 1M migrations happening in the first few seconds just trying to
deal with the load balancer and NUMA balancer fighting each other. While
that might happen to be good for SP, it's not universally good behaviour
and I've dealt with issues in the past whereby the NUMA balancer would
get confused and just ramp up the frequency it samples and migrates trying
to override the load balancer.

> We see the drop for the following benchmarks/tests,
> especially on 8 NUMA nodes servers. However, four and even 2 NUMA node
> servers are affected as well.
>
> Numbers show average performance drop (median of runtime collected from 5
> subsequential runs) compared to vanilla kernel.
>
> 2x AMD 7351 (EPYC Naples), 8 NUMA nodes
> ===================================
> NAS: sp_C test: -50%, peak perf. drop with 8 threads

I hadn't tested 8 threads specifically I think that works out as
using 12.5% of the available machine. The allowed imbalance between
nodes means that some SP instances will stack on the same node but not
the same CPU.

> NAS: mg_D: -10% with 16 threads

While I do not have the most up to date figures available, I found the
opposite trend at 21 threads (the test case I used were 1/3rd available
CPUs and as close to maximum CPUs as possible). There I found it was 10%
faster for an 8 node machine.

For 4 nodes, using a single JVM was performance neutral *on average* but
much less variable. With one JVM per node, there was a mix of small <2%
regressions for some thread counts and up to 9% faster on others.

> SPECjvm2008: co_sunflow test: -20% (peak drop with 8 threads)
> SPECjvm2008: compress and cr_signverify tests: -10%(peak drop with 8
> threads)

I didn't run SPECjvm for multiple thread sizes so I don't have data yet
and may not have for another day at least.

> SPECjbb2005: -10% for 16 threads
>

I found this to depend in the number of JVMs used and the thread count.
Slower at low thread counts, faster at higher thread counts but with
more stable results with the series applied and less NUMA balancer
activity.

This is somewhat of a dilemma. Without the series, the load balancer and
NUMA balancer use very different criteria on what should happen and
results are not stable. In some cases acting randomly happens to work
out and in others it does not and overall it depends on the workload and
machine. It's very highly dependent on both the workload and the machine
and it's a question if we want to continue dealing with two parts of the
scheduler disagreeing on what criteria to use or try to improve the
reconciled load and memory balancer sharing similar logic.

In *general*, I found that the series won a lot more than it lost across
a spread of workloads and machines but unfortunately it's also an area
where counter-examples can be found.

--
Mel Gorman
SUSE Labs