Re: [PATCH] powerpc/paravirt: Improve vcpu_is_preempted

From: Aboorva Devarajan
Date: Tue Oct 17 2023 - 08:19:45 EST

Next message: Frederic Weisbecker: "[PATCH 2/3] rcu-tasks: Pull sampling of ->percpu_dequeue_lim out of loop"
Previous message: Frederic Weisbecker: "[PATCH 1/3] rcu-tasks: Add printk()s to localize boot-time self-test hang"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 2023-10-09 at 10:47 +0530, Srikar Dronamraju wrote:

Hi Srikar,

Benchmarked this patch on baremetal POWER9 node by launching KVM to
observe the improvements achieved in KVM with the patched kernel.
Below, you can find the schbench latency result comparision.

System was running on SMT4 mode with the below configuration:

Setup:

$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 8
Model: 2.3 (pvr 004e 1203)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2300.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
NUMA node250 CPU(s):
NUMA node251 CPU(s):
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):

- Baseline kernel : v6.6.0-rc5
- Patched kernel : v6.6.0-rc5 with patchset
- schbench version : upto commit 2eef44 (schbench: record the
execution time in the matrix multiplication mode)

Results:
--------

These results shows the schbench latency on a patched kernel compared
to a baseline kernel on KVM. The numbers in the "compare%" column
represent the percentage difference between the latency measured on the
baseline kernel and the patched kernel. A negative percentage means the
patched kernel performs less optimially (higher latency) than the
baseline, while a positive percentage means it performs better (lower
latency).

Scenarios:
----------

Case 1: No Noise

Host: Idle
KVM 1: Launched a KVM affined to 0-39 CPUs (40 CPUs)
KVM 1 (Workload) : ./schbench -m 20 -t 2 -r 30 (benchmark)

schbench latency (niter: 20)

percentile compare% (avg)
(higher the better)

50.0th: -4.84
75.0th: -8.09
90.0th: -3.39
95.0th: +5.16
99.0th: +90.78
99.5th: +36.34
99.9th: +8.31

--------------------------------------------------------

Case 2: With Noise: Over-commit case: Multiple KVM guests sharing the
same set of CPUs

Two KVM instances are launched, where one being benchmarked, and the
other executing a workload to introduce noise.

KVM 1: Launched a KVM affined to 0-39 CPUs (40 CPUs)
KVM 1 (Workload) : ./schbench -m 20 -t 2 -r 30 (benchmark)

KVM 2 (Noise): Launched a KVM affined to 0-39 CPUs

schbench latency (niter: 20)

percentile compare% (avg)
(higher the better)

50.0th: -1.47
75.0th: -5.72
90.0th: +7.88
95.0th: +10.71
99.0th: +512.08
99.5th: +380.61
99.9th: +90.39

--------------------------------------------------------

Case 3: Overlap case: Multiple KVM guests sharing a subset of CPUs.

Two KVM instances are launched, where one being benchmarked, and the
other executing a workload to introduce noise.

KVM 1: Launched a KVM affined to 0-39 CPUs (40 CPUs)
KVM 1 (Workload) : ./schbench -m 20 -t 2 -r 30 (benchmark)

KVM 2 (Noise): Launched a KVM affined to 0-19 CPUs

schbench latency (niter: 20)

percentile compare% (avg)
(higher the better)

50.0th: -1.63
75.0th: -2.78
90.0th: +57.62
95.0th: +87.90
99.0th: +343.66
99.5th: +178.01
99.9th: +36.07

--------------------------------------------------------

The above results demonstrate the effectiveness of the proposed
approach, which utilizes the idle-hint in lppaca to detect the
preempted vCPU more efficiently. This approach is beneficial for
improving schbench latency on KVM, particularly the tail latencies.

Thanks,
Aboorva

> PowerVM Hypervisor dispatches on a whole core basis. In a shared
> LPAR, a
> CPU from a core that is preempted may have a larger latency. In
> such a scenario, its preferable to choose a different CPU to run.
>
> If one of the CPUs in the core is active, i.e neither CEDED nor
> preempted, then consider this CPU as not preempted.
>
> Also if any of the CPUs in the core has yielded but OS has not
> requested
> CEDE or CONFER, then consider this CPU to be preempted.
>
> Cc: Ajay Kaher <akaher@xxxxxxxxxx>
> Cc: Alexey Makhalov <amakhalov@xxxxxxxxxx>
> Cc: Christophe Leroy <christophe.leroy@xxxxxxxxxx>
> Cc: Juergen Gross <jgross@xxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Cc: linuxppc-dev@xxxxxxxxxxxxxxxx
> Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
> Cc: Nicholas Piggin <npiggin@xxxxxxxxx>
> Cc: virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
> Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> ---
> arch/powerpc/include/asm/paravirt.h | 33 ++++++++++++++++++++++++++-
> --
> 1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/paravirt.h
> b/arch/powerpc/include/asm/paravirt.h
> index e08513d73119..a980756f58df 100644
> --- a/arch/powerpc/include/asm/paravirt.h
> +++ b/arch/powerpc/include/asm/paravirt.h
> @@ -121,9 +121,19 @@ static inline bool vcpu_is_preempted(int cpu)
> if (!is_shared_processor())
> return false;
>
> + if (!(yield_count_of(cpu) & 1))
> + return false;
> +
> + /*
> + * If CPU has yielded but OS has not requested idle then this
> CPU is
> + * definitely preempted.
> + */
> + if (!lppaca_of(cpu).idle)
> + return true;
> +
> #ifdef CONFIG_PPC_SPLPAR
> if (!is_kvm_guest()) {
> - int first_cpu;
> + int first_cpu, i;
>
> /*
> * The result of vcpu_is_preempted() is used in a
> @@ -149,11 +159,28 @@ static inline bool vcpu_is_preempted(int cpu)
> */
> if (cpu_first_thread_sibling(cpu) == first_cpu)
> return false;
> +
> + /*
> + * If any of the threads of this core is not preempted
> or
> + * ceded, then consider this CPU to be non-preempted
> + */
> + first_cpu = cpu_first_thread_sibling(cpu);
> + for (i = first_cpu; i < first_cpu + threads_per_core;
> i++) {
> + if (i == cpu)
> + continue;
> + if (!(yield_count_of(i) & 1))
> + return false;
> + if (!lppaca_of(i).idle)
> + return true;
> + }
> }
> #endif
>
> - if (yield_count_of(cpu) & 1)
> - return true;
> + /*
> + * None of the threads in this thread group are running but
> none of
> + * them were preempted too. Hence assume the thread to be
> + * non-preempted.
> + */
> return false;
> }
>

Next message: Frederic Weisbecker: "[PATCH 2/3] rcu-tasks: Pull sampling of ->percpu_dequeue_lim out of loop"
Previous message: Frederic Weisbecker: "[PATCH 1/3] rcu-tasks: Add printk()s to localize boot-time self-test hang"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]