Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization

From: K Prateek Nayak
Date: Mon Jul 10 2023 - 07:07:46 EST


Hello Chenyu,

Thank you for sharing this extended version. Sharing the results from
testing below.

tl;dr

- tbench, netperf and unixbench-spawn see an improvement with ILB_UTIL.

- schbench (old) sees a regression in tail latency once system is heavily
loaded. DeathStarBench and SPECjbb too see a small drop under those
conditions.

- Rest of the benchmark results do not vary much.

On 6/21/2023 12:59 PM, Chen Yu wrote:
> Hi Gautham,
> On 2023-06-15 at 11:31:07 +0530, Gautham R. Shenoy wrote:
>> Hello Chen Yu,
>>
>>
>> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
>>> When CPU is about to enter idle, it invokes newidle_balance() to pull
>>> some tasks from other runqueues. Although there is per domain
>>> max_newidle_lb_cost to throttle the newidle_balance(), it would be
>>> good to further limit the scan based on overall system utilization.
>>> The reason is that there is no limitation for newidle_balance() to
>>> launch this balance simultaneously on multiple CPUs. Since each
>>> newidle_balance() has to traverse all the CPUs to calculate the
>>> statistics one by one, this total time cost on newidle_balance()
>>> could be O(n^2). This is not good for performance or power saving.
>>>
>>> For example, sqlite has spent quite some time on newidle balance()
>>> on Intel Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
>>> 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
>>> 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
>>>
>>> Based on this observation, limit the scan depth of newidle_balance()
>>> by considering the utilization of the LLC domain. Let the number of
>>> scanned groups be a linear function of the utilization ratio:
>>>
>>
>> Is there any particular reason why this is being limited only to the
>> LLC domain ?
>>
>> On architectures where the LLC domain may not be so large (POWER9/10,
>> AMD), the additional cost is usually paid at the higher domains where
>> the number of groups is greater / equal to the number of groups in the
>> LLC domain and where sd_span is pretty large. It would be good to
>> explore avoiding the scan cost on those domains as well, right?
>>
>>> nr_groups_to_scan = nr_groups * (1 - util_ratio)
>>
>> If we can extend this logic to higher domains, on a Zen3 1 Socket
>> server with 128 CPUs at the DIE domain containing 8 groups, we can
>> expect a significant reduction in the time spent doing
>> update_sg_lb_stats() at higher utilizations.
>>
>> util_ratio nr_groups_to_scan nr_cpus_scanned
>> ========================================================
>> 0.9 1 16 (-87.5%)
>> 0.75 2 32 (-75%)
>> 0.5 4 64 (-50%)
>> 0.25 6 96 (-25%)
>> 0.1 7 112 (-12.5%)
>>
>>
>> On a Zen 4 1 socket server with 192 CPUs at the DIE domain containing
>> 12 groups, values will be:
>>
>> util_ratio nr_groups_to_scan nr_cpus_scanned
>> ========================================================
>> 0.9 1 16 (-91%)
>> 0.75 3 48 (-75%)
>> 0.5 6 96 (-50%)
>> 0.25 9 144 (-25%)
>> 0.1 10 160 (-16.7%)
>>
> I have an idea to limit scan depth for newidle balance for big domains.
> These domains should have CPUs higher than/equals to LLC(MC domain).
> However it seems that in current kernel only domain with SD_SHARE_PKG_RESOURCES
> flag set will have the shared struct sched_domain_shared among the CPUs in this
> domain. And this is reasonable because the cost to access the struct sched_domain_shared
> is lower if the CPUs share cache. Since ILB_UTIL relies on the sched_domain_shared
> to get the scan depth, I removed the restriction of SD_SHARE_PKG_RESOURCES
> during sched_domain_shared assignment.
> If non-LLC domain's sched_domain_shared is only used for ILB_UTIL,
> the overhead should be not too high(only periodic load balance would
> write to sched_domain_shared). Here is a untest patch which shows what
> I'm thinking of, and I'll do some refinement based on this:
>
> thanks,
> Chenyu
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 67b573d5bf28..ce7ffbb7b3f8 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -82,6 +82,10 @@ struct sched_domain_shared {
> atomic_t nr_busy_cpus;
> int has_idle_cores;
> int nr_idle_scan;
> + /* ilb scan depth and load balance statistic snapshot */
> + int ilb_nr_scan;
> + unsigned long ilb_total_load;
> + unsigned long ilb_total_capacity;
> };
>
> struct sched_domain {
> @@ -152,6 +156,7 @@ struct sched_domain {
> struct sched_domain_shared *shared;
>
> unsigned int span_weight;
> + unsigned int nr_groups;
> /*
> * Span of all CPUs in this domain.
> *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d724215826ae..34619dbb2f4e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10162,6 +10162,54 @@ static void update_idle_cpu_scan(struct lb_env *env,
> WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
> }
>
> +/*
> + * Get the domain shared information of dst CPU.
> + */
> +static struct sched_domain_shared *get_sd_shared(struct lb_env *env)
> +{
> + /*
> + * Do not consider the domains smaller than LLC because those
> + * small domains have low cost on idle load balance.
> + */
> + if (env->sd->span_weight < per_cpu(sd_llc_size, env->dst_cpu))
> + return NULL;
> +
> + return env->sd->shared;
> +}
> +
> +static void update_ilb_group_scan(struct lb_env *env,
> + unsigned long sum_util,
> + struct sched_domain_shared *sd_share,
> + struct sd_lb_stats *sds)
> +{
> + u64 tmp, nr_scan;
> +
> + if (!sched_feat(ILB_UTIL) || env->idle == CPU_NEWLY_IDLE)
> + return;
> +
> + if(!sd_share)
> + return;
> + /*
> + * Limit the newidle balance scan depth based on overall system
> + * utilization:
> + * nr_groups_scan = nr_groups * (1 - util_ratio)
> + * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
> + */
> + nr_scan = env->sd->nr_groups * sum_util;
> + tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
> + do_div(nr_scan, tmp);
> + nr_scan = env->sd->nr_groups - nr_scan;
> + if ((int)nr_scan != sd_share->ilb_nr_scan)
> + WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan);
> +
> + /* save the statistic snapshot of the periodic load balance */
> + if (sds->total_load != sd_share->ilb_total_load)
> + WRITE_ONCE(sd_share->ilb_total_load, sds->total_load);
> +
> + if (sds->total_capacity != sd_share->ilb_total_capacity)
> + WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity);
> +}
> +
> /**
> * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -10170,11 +10218,17 @@ static void update_idle_cpu_scan(struct lb_env *env,
>
> static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
> {
> + struct sched_domain_shared *sd_share = get_sd_shared(env);
> struct sched_group *sg = env->sd->groups;
> struct sg_lb_stats *local = &sds->local_stat;
> struct sg_lb_stats tmp_sgs;
> unsigned long sum_util = 0;
> - int sg_status = 0;
> + int sg_status = 0, nr_scan_ilb;
> + bool ilb_util_enabled = sched_feat(ILB_UTIL) && env->idle == CPU_NEWLY_IDLE &&
> + sd_share && READ_ONCE(sd_share->ilb_total_capacity);
> +
> + if (ilb_util_enabled)
> + nr_scan_ilb = sd_share->ilb_nr_scan;
>
> do {
> struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -10192,6 +10246,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
>
> + if (ilb_util_enabled && --nr_scan_ilb <= 0) {
> + /*
> + * Borrow the statistic of previous periodic load balance.
> + * The data might be stale and it is a trade-off.
> + */
> + sds->total_load = READ_ONCE(sd_share->ilb_total_load);
> + sds->total_capacity = READ_ONCE(sd_share->ilb_total_capacity);
> +
> + break;
> + }
> +
> if (local_group)
> goto next_group;
>
> @@ -10239,6 +10304,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> }
>
> update_idle_cpu_scan(env, sum_util);
> + update_ilb_group_scan(env, sum_util, sd_share, sds);
> }
>
> /**
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..8f6e5b08408d 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
>
> SCHED_FEAT(RT_RUNTIME_SHARE, false)
> SCHED_FEAT(LB_MIN, false)
> +SCHED_FEAT(ILB_UTIL, true)
> SCHED_FEAT(ATTACH_AGE_LOAD, true)
>
> SCHED_FEAT(WA_IDLE, true)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..98bfac5f7836 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> struct cpumask *covered = sched_domains_tmpmask;
> struct sd_data *sdd = sd->private;
> struct sched_domain *sibling;
> - int i;
> + int i, nr_groups = 0;
>
> cpumask_clear(covered);
>
> @@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> if (!sg)
> goto fail;
>
> + nr_groups++;
> +
> sg_span = sched_group_span(sg);
> cpumask_or(covered, covered, sg_span);
>
> @@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
> last->next = first;
> }
> sd->groups = first;
> + sd->nr_groups = nr_groups;
>
> return 0;
>
> @@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> struct sd_data *sdd = sd->private;
> const struct cpumask *span = sched_domain_span(sd);
> struct cpumask *covered;
> - int i;
> + int i, nr_groups = 0;
>
> lockdep_assert_held(&sched_domains_mutex);
> covered = sched_domains_tmpmask;
> @@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)
>
> sg = get_group(i, sdd);
>
> + nr_groups++;
> +
> cpumask_or(covered, covered, sched_group_span(sg));
>
> if (!first)
> @@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
> }
> last->next = first;
> sd->groups = first;
> + sd->nr_groups = nr_groups;
>
> return 0;
> }
> @@ -1641,14 +1647,12 @@ sd_init(struct sched_domain_topology_level *tl,
> }
>
> /*
> - * For all levels sharing cache; connect a sched_domain_shared
> + * For all levels; connect a sched_domain_shared
> * instance.
> */
> - if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> - sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> - atomic_inc(&sd->shared->ref);
> - atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> - }
> + sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> + atomic_inc(&sd->shared->ref);
> + atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
>
> sd->private = sdd;
>

o System Details

Dual Socket 3rd Generation EPYC System (2 x 64C/128T)

o NPS Modes

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
Total 2 NUMA nodes in the dual socket machine.

Node 0: 0-63, 128-191
Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
Total 4 NUMA nodes exist over 2 socket.

Node 0: 0-31, 128-159
Node 1: 32-63, 160-191
Node 2: 64-95, 192-223
Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
Total 8 NUMA nodes exist over 2 socket.

Node 0: 0-15, 128-143
Node 1: 16-31, 144-159
Node 2: 32-47, 160-175
Node 3: 48-63, 176-191
Node 4: 64-79, 192-207
Node 5: 80-95, 208-223
Node 6: 96-111, 223-231
Node 7: 112-127, 232-255

o Kernel Versions

- tip - tip:sched/core at commit e2a1f85bf9f5 "sched/psi:
Avoid resetting the min update period when it is
unnecessary")

- ILB_UTIL - tip:sched/core + this patch

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test: tip ILB_UTIL
1-groups: 3.92 (0.00 pct) 3.66 (6.63 pct)
2-groups: 4.58 (0.00 pct) 4.18 (8.73 pct)
4-groups: 4.99 (0.00 pct) 4.46 (10.62 pct)
8-groups: 5.67 (0.00 pct) 5.39 (4.93 pct)
16-groups: 7.88 (0.00 pct) 10.43 (-32.36 pct)

o NPS2

Test: tip ILB_UTIL
1-groups: 3.82 (0.00 pct) 3.59 (6.02 pct)
2-groups: 4.40 (0.00 pct) 4.08 (7.27 pct)
4-groups: 4.84 (0.00 pct) 4.44 (8.26 pct)
8-groups: 5.45 (0.00 pct) 6.32 (-15.96 pct)
16-groups: 6.94 (0.00 pct) 11.71 (-68.73 pct)

o NPS4

Test: tip ILB_UTIL
1-groups: 3.82 (0.00 pct) 3.65 (4.45 pct)
2-groups: 4.44 (0.00 pct) 4.19 (5.63 pct)
4-groups: 4.86 (0.00 pct) 4.60 (5.34 pct)
8-groups: 5.42 (0.00 pct) 5.36 (1.10 pct)
16-groups: 6.68 (0.00 pct) 10.09 (-51.04 pct)

~~~~~~~~~~~~~~~~~~
~ schbench (Old) ~
~~~~~~~~~~~~~~~~~~

o NPS1

#workers: tip ILB_UTIL
1: 26.00 (0.00 pct) 26.00 (0.00 pct)
2: 27.00 (0.00 pct) 28.00 (-3.70 pct)
4: 31.00 (0.00 pct) 27.00 (12.90 pct)
8: 36.00 (0.00 pct) 40.00 (-11.11 pct)
16: 49.00 (0.00 pct) 50.00 (-2.04 pct)
32: 80.00 (0.00 pct) 80.00 (0.00 pct)
64: 169.00 (0.00 pct) 170.00 (-0.59 pct)
128: 343.00 (0.00 pct) 338.00 (1.45 pct)
256: 42048.00 (0.00 pct) 45760.00 (-8.82 pct)
512: 95104.00 (0.00 pct) 109696.00 (-15.34 pct)

o NPS2

#workers: tip ILB_UTIL
1: 23.00 (0.00 pct) 21.00 (8.69 pct)
2: 24.00 (0.00 pct) 25.00 (-4.16 pct)
4: 31.00 (0.00 pct) 29.00 (6.45 pct)
8: 41.00 (0.00 pct) 43.00 (-4.87 pct)
16: 48.00 (0.00 pct) 50.00 (-4.16 pct)
32: 81.00 (0.00 pct) 81.00 (0.00 pct)
64: 157.00 (0.00 pct) 180.00 (-14.64 pct)
128: 386.00 (0.00 pct) 385.00 (0.25 pct)
256: 48832.00 (0.00 pct) 52032.00 (-6.55 pct)
512: 92032.00 (0.00 pct) 113024.00 (-22.80 pct)

o NPS4

#workers: tip ILB_UTIL
1: 21.00 (0.00 pct) 23.00 (-9.52 pct)
2: 28.00 (0.00 pct) 30.00 (-7.14 pct)
4: 32.00 (0.00 pct) 33.00 (-3.12 pct)
8: 46.00 (0.00 pct) 51.00 (-10.86 pct)
16: 51.00 (0.00 pct) 54.00 (-5.88 pct)
32: 82.00 (0.00 pct) 88.00 (-7.31 pct)
64: 173.00 (0.00 pct) 175.00 (-1.15 pct)
128: 396.00 (0.00 pct) 387.00 (2.27 pct)
256: 48832.00 (0.00 pct) 46912.00 (3.93 pct)
512: 95104.00 (0.00 pct) 110720.00 (-16.41 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients: tip ILB_UTIL
1 452.49 (0.00 pct) 449.93 (-0.56 pct)
2 862.44 (0.00 pct) 875.04 (1.46 pct)
4 1604.27 (0.00 pct) 1626.23 (1.36 pct)
8 2966.77 (0.00 pct) 3036.80 (2.36 pct)
16 5176.70 (0.00 pct) 5402.88 (4.36 pct)
32 8205.24 (0.00 pct) 9256.48 (12.81 pct)
64 13956.71 (0.00 pct) 15581.58 (11.64 pct)
128 24005.50 (0.00 pct) 24782.63 (3.23 pct)
256 32457.61 (0.00 pct) 30810.66 (-5.07 pct)
512 34345.24 (0.00 pct) 40971.90 (19.29 pct)
1024 33432.92 (0.00 pct) 41604.06 (24.44 pct)

o NPS2

Clients: tip ILB_UTIL
1 453.73 (0.00 pct) 444.72 (-1.98 pct)
2 861.71 (0.00 pct) 853.67 (-0.93 pct)
4 1599.14 (0.00 pct) 1573.69 (-1.59 pct)
8 2951.03 (0.00 pct) 3021.87 (2.40 pct)
16 5080.32 (0.00 pct) 5464.64 (7.56 pct)
32 7900.41 (0.00 pct) 10304.44 (30.42 pct)
64 14629.65 (0.00 pct) 17083.33 (16.77 pct)
128 23155.88 (0.00 pct) 25278.86 (9.16 pct)
256 33449.57 (0.00 pct) 32964.11 (-1.45 pct)
512 33757.47 (0.00 pct) 40951.04 (21.30 pct)
1024 34823.14 (0.00 pct) 41737.76 (19.85 pct)

o NPS4

Clients: tip ILB_UTIL
1 450.14 (0.00 pct) 451.88 (0.38 pct)
2 863.26 (0.00 pct) 864.96 (0.19 pct)
4 1618.71 (0.00 pct) 1632.00 (0.82 pct)
8 2929.35 (0.00 pct) 3071.80 (4.86 pct)
16 5114.04 (0.00 pct) 5373.74 (5.07 pct)
32 7912.18 (0.00 pct) 8830.49 (11.60 pct)
64 14424.72 (0.00 pct) 15598.13 (8.13 pct)
128 23614.97 (0.00 pct) 24563.76 (4.01 pct)
256 34365.13 (0.00 pct) 32096.70 (-6.60 pct)
512 34215.50 (0.00 pct) 42068.49 (22.95 pct)
1024 35421.90 (0.00 pct) 42230.56 (19.22 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

- 10 Runs:

Test: tip ILB_UTIL
Copy: 271317.35 (0.00 pct) 304210.62 (12.12 pct)
Scale: 205533.77 (0.00 pct) 204155.75 (-0.67 pct)
Add: 221624.62 (0.00 pct) 228757.07 (3.21 pct)
Triad: 228500.68 (0.00 pct) 236454.48 (3.48 pct)

- 100 Runs:

Test: tip ILB_UTIL
Copy: 317381.65 (0.00 pct) 321587.90 (1.32 pct)
Scale: 214145.00 (0.00 pct) 211397.70 (-1.28 pct)
Add: 239243.29 (0.00 pct) 235497.67 (-1.56 pct)
Triad: 249477.76 (0.00 pct) 240764.14 (-3.49 pct)

o NPS2

- 10 Runs:

Test: tip ILB_UTIL
Copy: 277761.29 (0.00 pct) 279582.97 (0.65 pct)
Scale: 215193.83 (0.00 pct) 203628.71 (-5.37 pct)
Add: 242725.75 (0.00 pct) 232522.80 (-4.20 pct)
Triad: 237253.44 (0.00 pct) 245716.42 (3.56 pct)

- 100 Runs:

Test: tip ILB_UTIL
Copy: 318082.10 (0.00 pct) 320640.80 (0.80 pct)
Scale: 219338.56 (0.00 pct) 222158.47 (1.28 pct)
Add: 248118.20 (0.00 pct) 254163.15 (2.43 pct)
Triad: 247088.55 (0.00 pct) 252459.53 (2.17 pct)

o NPS4

- 10 Runs:

Test: tip ILB_UTIL
Copy: 273307.14 (0.00 pct) 269979.40 (-1.21 pct)
Scale: 235715.23 (0.00 pct) 225429.20 (-4.36 pct)
Add: 244500.40 (0.00 pct) 227988.81 (-6.75 pct)
Triad: 250600.04 (0.00 pct) 234012.67 (-6.61 pct)

- 100 Runs:

Test: tip ILB_UTIL
Copy: 345396.19 (0.00 pct) 335548.25 (-2.85 pct)
Scale: 241521.63 (0.00 pct) 228991.04 (-5.18 pct)
Add: 261157.86 (0.00 pct) 247020.34 (-5.41 pct)
Triad: 267804.99 (0.00 pct) 258260.01 (-3.56 pct)

~~~~~~~~~~~
~ netperf ~
~~~~~~~~~~~

o NPS1

Test: tip ILB_UTIL
1-clients: 102839.97 (0.00 pct) 101826.77 (-0.98 pct)
2-clients: 98428.08 (0.00 pct) 98563.25 (0.13 pct)
4-clients: 92298.45 (0.00 pct) 95310.26 (3.26 pct)
8-clients: 85618.41 (0.00 pct) 87859.85 (2.61 pct)
16-clients: 78722.18 (0.00 pct) 79430.42 (0.89 pct)
32-clients: 73610.75 (0.00 pct) 76459.08 (3.86 pct)
64-clients: 55285.07 (0.00 pct) 64071.43 (15.89 pct)
128-clients: 31176.92 (0.00 pct) 37287.20 (19.59 pct)
256-clients: 20011.44 (0.00 pct) 31243.73 (56.12 pct)

o NPS2

Test: tip ILB_UTIL
1-clients: 103105.55 (0.00 pct) 99162.95 (-3.82 pct)
2-clients: 98720.29 (0.00 pct) 96055.84 (-2.69 pct)
4-clients: 92289.39 (0.00 pct) 92818.61 (0.57 pct)
8-clients: 84998.63 (0.00 pct) 86693.17 (1.99 pct)
16-clients: 76395.81 (0.00 pct) 77137.01 (0.97 pct)
32-clients: 71110.89 (0.00 pct) 70154.80 (-1.34 pct)
64-clients: 49526.21 (0.00 pct) 55032.79 (11.11 pct)
128-clients: 27917.51 (0.00 pct) 36377.03 (30.30 pct)
256-clients: 20067.17 (0.00 pct) 27607.78 (37.57 pct)

o NPS4

Test: tip ILB_UTIL
1-clients: 102139.49 (0.00 pct) 103414.93 (1.24 pct)
2-clients: 98259.53 (0.00 pct) 101472.40 (3.26 pct)
4-clients: 91576.79 (0.00 pct) 96917.69 (5.83 pct)
8-clients: 84742.30 (0.00 pct) 90389.72 (6.66 pct)
16-clients: 79540.75 (0.00 pct) 85183.23 (7.09 pct)
32-clients: 71166.14 (0.00 pct) 78511.48 (10.32 pct)
64-clients: 51763.24 (0.00 pct) 61334.30 (18.49 pct)
128-clients: 27829.29 (0.00 pct) 35989.34 (29.32 pct)
256-clients: 24185.37 (0.00 pct) 35769.17 (47.89 pct)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

tip ILB_UTIL
Hmean unixbench-dhry2reg-1 41322625.19 ( 0.00%) 41202944.91 ( -0.29%)
Hmean unixbench-dhry2reg-512 6252491108.60 ( 0.00%) 6193511930.01 * -0.94%*
Amean unixbench-syscall-1 2501398.27 ( 0.00%) 2558258.57 * -2.27%*
Amean unixbench-syscall-512 8120524.00 ( 0.00%) 8014692.00 * 1.30%*
Hmean unixbench-pipe-1 2359346.02 ( 0.00%) 2395716.82 * 1.54%*
Hmean unixbench-pipe-512 338790322.61 ( 0.00%) 339462110.52 ( 0.20%)
Hmean unixbench-spawn-1 4261.52 ( 0.00%) 4786.09 * 12.31%*
Hmean unixbench-spawn-512 64328.93 ( 0.00%) 68328.36 * 6.22%*
Hmean unixbench-execl-1 3677.73 ( 0.00%) 3671.96 ( -0.16%)
Hmean unixbench-execl-512 11984.83 ( 0.00%) 13272.01 ( 10.74%)

o NPS2

tip ILB_UTIL
Hmean unixbench-dhry2reg-1 41311787.29 ( 0.00%) 41209738.92 ( -0.25%)
Hmean unixbench-dhry2reg-512 6243873272.76 ( 0.00%) 6198007442.15 * -0.73%*
Amean unixbench-syscall-1 2503190.70 ( 0.00%) 2559295.30 * -2.24%*
Amean unixbench-syscall-512 8012388.13 ( 0.00%) 7984268.83 * 0.35%*
Hmean unixbench-pipe-1 2340486.25 ( 0.00%) 2395174.42 * 2.34%*
Hmean unixbench-pipe-512 338965319.79 ( 0.00%) 339972146.39 ( 0.30%)
Hmean unixbench-spawn-1 5241.83 ( 0.00%) 5041.98 * -3.81%*
Hmean unixbench-spawn-512 65799.86 ( 0.00%) 68871.88 * 4.67%*
Hmean unixbench-execl-1 3670.65 ( 0.00%) 3659.10 ( -0.31%)
Hmean unixbench-execl-512 13682.00 ( 0.00%) 13984.58 ( 2.21%)

o NPS4

tip ILB_UTIL
Hmean unixbench-dhry2reg-1 41025577.99 ( 0.00%) 41039940.89 ( 0.04%)
Hmean unixbench-dhry2reg-512 6255568261.91 ( 0.00%) 6216198481.97 * -0.63%*
Amean unixbench-syscall-1 2507165.37 ( 0.00%) 2553468.33 * -1.85%*
Amean unixbench-syscall-512 7458476.50 ( 0.00%) 7483366.27 * -0.33%*
Hmean unixbench-pipe-1 2369301.21 ( 0.00%) 2397653.84 * 1.20%*
Hmean unixbench-pipe-512 340299405.72 ( 0.00%) 340332182.64 ( 0.01%)
Hmean unixbench-spawn-1 5571.78 ( 0.00%) 5389.50 ( -3.27%)
Hmean unixbench-spawn-512 63999.96 ( 0.00%) 68343.41 * 6.79%*
Hmean unixbench-execl-1 3587.15 ( 0.00%) 3628.48 * 1.15%*
Hmean unixbench-execl-512 14184.17 ( 0.00%) 13720.55 ( -3.27%)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

base: 298681.00 (var: 2.31%)
ILB_UTIL 292352.67 (var: 3.31%) (-2.11%)

o NPS2:

base: 296570.00 (var: 1.01%)
ILB_UTIL 298804.67 (var: 1.50%) (0.75%)

o NPS4:

base 297181.67 (var: 0.46%)
ILB_UTIL 297495.00 (var: 0.33%) (0.10%)

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1:

- 1 CCD

base: 1.00 (var: 0.27%)
ILB_UTIL: 1.03 (var: 0.16%) (+3.31%)

- 2 CCD

base: 1.00 (var: 0.42%)
ILB_UTIL: 1.01 (var: 0.19%) (+1.48%)

- 4 CCD

base: 1.00 (var: 0.46%)
ILB_UTIL: 0.98 (var: 0.17%) (-2.00%)

- 8 CCD

base: 1.00 (var: 0.63%)
ILB_UTIL: 0.96 (var: 0.46%) (-3.79%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ SPECjbb2015 - multi-JVM ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~

max-jOPS 1.00 0.99 (-1.11%)
critical-jOPS 1.00 0.99 (-1.06%)

--

I have a couple of theories:

o Either new_idle_balance is failing to find an overloaded busy rq as a
result of the limit.

o Or, there is a chain reaction where pulling from a loaded rq which is not
the most loaded, will lead to more new_idle_balancing attempts which is
degrading performance.

I'll go back and get some data to narrow down the cause. Meanwhile if
there is any specific benchmark you would like me to run on the test
system, please do let me know.

--
Thanks and Regards,
Prateek