Re: [PATCH] locking/osq: Drop the overload of osq lock

From: Boqun Feng
Date: Sun Jun 26 2016 - 02:55:34 EST


On Sun, Jun 26, 2016 at 02:10:57PM +0800, Boqun Feng wrote:
> On Sun, Jun 26, 2016 at 01:21:04PM +0800, panxinhui wrote:
> >
> > > å 2016å6æ26æï03:20ïPeter Zijlstra <peterz@xxxxxxxxxxxxx> åéï
> > >
> > > On Sun, Jun 26, 2016 at 01:27:56AM +0800, panxinhui wrote:
> > >>>> Would that not have issues where the owner cpu is kept running but the
> > >>>> spinner (ie. _this_ vcpu) gets preempted? I would think that in that
> > >>>> case we too want to stop spinning.
> > >>>>
> > >>>
> > >> do you mean that the spinner detect itself had yield out during the
> > >> big spin loop?
> > >>
> > >> It is very possible to happen. BUT if spinner(on this vcpu) yield
> > >> out, the next spinner would break the spin loop. AND if spinner
> > >> detect itself yield out once, itâs very possible to get the osq lock
> > >> soon as long as the ower vcpu is running.
> > >>
> > >> SO I think we need just check the owner vcpuâs yield_count.
> > >
> > > I had a quick look at KVM and it looks like it only has
> > > kvm_cpu::preempted, which would suggest the interface boqun proposed.
> > >
> > > We'll have to look at many of the other virt platforms as well to see
> > > what they can do.
> > >
> > > We could also provide _both_ interfaces and a platform can implement
> > > whichever variant (or both) it can.
> > >
> > the kvm code on ppc has implemented yield_count.
> > It let me feel a little relaxed. :)
> >
> > looks like we could introduce the interface like below.
> >
> > bool vcpu_is_preempted(int cpu)
> > {
> > return arch_vcpu_is_preempted(cpu);
> > }
> >
> > #ifdef arch_vcpu_has_yield_count
> > bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
> > {
> > return arch_get_vcpu_yield_count() != yield_count;
> > }
> >
> > #else
> > bool vcpu_has_preemtped_once(int cpu, unsigned int yield_count)
> > {
> > /*just let called know it is preepmpted*/
> > return vcpu_is_preempted(cpu);
> > }
> > #endif
> >
>
> Unfortunately, on PPC, we compile pseries code along with powernv code
> in a kernel binary, therefore we need to wire the proper primitives at
> runtime.
>
> I've cooked the following patch, please note I haven't test this, this
> is just a POC.
>
> Regards,
> Boqun
>
> ---------------->8
> virt: Introduce vcpu preemption detection functions
>
> Lock holder preemption is an serious problem in virtualized environment
> for locking. Different archs and hypervisors employ different ways to
> try to solve the problem in different locking mechanisms. And recently
> Pan Xinhui found a significant performance drop in osq_lock(), which
> could be solved by providing a mechanism to detect the preemption of
> vcpu.
>
> Therefore this patch introduces several vcpu preemption detection
> primitives, which allows locking primtives to save the time of
> unnecesarry spinning. These primitives act as an abstract layer between
> locking functions and arch or hypervisor related code.
>
> There are two sets of primitives:
>
> 1. vcpu_preempt_count() and vcpu_has_preempted(), they must be used
> pairwisely in a same preempt disable critical section. And they
> could detect whether a vcpu preemption happens between them.
>
> 2. vcpu_is_preempted(), used to check whether other cpu's vcpu is
> preempted.
>
> This patch also implements those primitives on pseries and wire them up.
>
> Signed-off-by: Boqun Feng <boqun.feng@xxxxxxxxx>
> ---
> arch/powerpc/platforms/pseries/Kconfig | 1 +
> arch/powerpc/platforms/pseries/setup.c | 28 +++++++++++
> include/linux/vcpu_preempt.h | 90 ++++++++++++++++++++++++++++++++++
> 3 files changed, 119 insertions(+)
> create mode 100644 include/linux/vcpu_preempt.h
>
> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
> index bec90fb30425..7d24c3e48878 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -21,6 +21,7 @@ config PPC_PSERIES
> select HOTPLUG_CPU if SMP
> select ARCH_RANDOM
> select PPC_DOORBELL
> + select HAS_VCPU_PREEMPTION_DETECTION
> default y
>
> config PPC_SPLPAR
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 9883bc7ea007..5d4aed54e039 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -42,6 +42,7 @@
> #include <linux/of.h>
> #include <linux/of_pci.h>
> #include <linux/kexec.h>
> +#include <linux/vcpu_preempt.h>
>
> #include <asm/mmu.h>
> #include <asm/processor.h>
> @@ -501,6 +502,31 @@ static void __init find_and_init_phbs(void)
> of_pci_check_probe_only();
> }
>
> +struct vcpu_preempt_ops vcpu_preempt_ops = DEFAULT_VCPU_PREEMPT_OPS;
> +EXPORT_SYMBOL(vcpu_preempt_ops);
> +
> +static long pseries_vcpu_preempt_count(void)
> +{
> + return be32_to_cpu(get_lppaca()->yield_count);
> +}
> +
> +static bool pseries_vcpu_is_preempted(int cpu)
> +{
> + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
> +}
> +
> +static bool pseries_vcpu_has_preempted(long vpc)
> +{
> + return pseries_vcpu_preempt_count() != vpc;
> +}
> +
> +static void __init pseries_setup_vcpu_preempt_ops(void)
> +{
> + vcpu_preempt_ops.preempt_count = pseries_vcpu_preempt_count;
> + vcpu_preempt_ops.is_preempted = pseries_vcpu_is_preempted;
> + vcpu_preempt_ops.has_preempted = pseries_vcpu_has_preempted;
> +}
> +
> static void __init pSeries_setup_arch(void)
> {
> set_arch_panic_timeout(10, ARCH_PANIC_TIMEOUT);
> @@ -549,6 +575,8 @@ static void __init pSeries_setup_arch(void)
> "%ld\n", rc);
> }
> }
> +
> + pseries_setup_vcpu_preempt_ops();
> }
>
> static int __init pSeries_init_panel(void)
> diff --git a/include/linux/vcpu_preempt.h b/include/linux/vcpu_preempt.h
> new file mode 100644
> index 000000000000..4a88414bb83f
> --- /dev/null
> +++ b/include/linux/vcpu_preempt.h
> @@ -0,0 +1,90 @@
> +/*
> + * Primitives for checking the vcpu preemption from the guest.
> + */
> +
> +static long __vcpu_preempt_count(void)
> +{
> + return 0;
> +}
> +
> +static bool __vcpu_has_preempted(long vpc)
> +{
> + return false;
> +}
> +
> +static bool __vcpu_is_preempted(int cpu)
> +{
> + return false;
> +}
> +
> +struct vcpu_preempt_ops {
> + /*
> + * Get the current vcpu's "preempt count", which is going to use for
> + * checking whether the current vcpu has ever been preempted
> + */
> + long (*preempt_count)(void);
> +
> + /*
> + * Return whether a vcpu is preempted
> + */
> + bool (*is_preempted)(int cpu);
> +
> + /*
> + * Given a "vcpu preempt count", Return whether a vcpu preemption ever
> + * happened after the .preempt_count() was called.
> + */
> + bool (*has_preempted)(long vpc);
> +};
> +
> +struct vcpu_preempt_ops vcpu_preempt_ops;
> +

This should be:

extern struct vcpu_preempt_ops vcpu_preempt_ops;

And I tested this one along with modified version of Xinhui's patch.

The test showed that even in a not over-committed system, the overhead
of mutex_lock() could go down from 2.58%(w/o patches) to 0.87%(w/
patches).

Xinhui, I did also modifiy the title and commit log of your patch, feel
free to take anything into your patch.

Regards,
Boqun

----------------->8
locking/osq: Stop spinning in osq_lock if related vcpus get preempted

In a virtualized environment, a cpu may hold the osq_lock while its vcpu
gets preempted, which makes the spinning of other cpus in the osq
pointless, because it may be quite a while for the holder to get back to
run and notify its successor to go.

To fix this, we use the vcpu preemption detection mechanism to check
whether the holder is preempted out, if so we just stop spinning in the
osq_lock(). Morever, if the spinner has ever been preempted, it also
indicates that the spining in osq_lock() is pointless, so we detect this
and stop spinning too.

Signed-off-by: Pan Xinhui <xinhui.pan@xxxxxxxxxxxxxxxxxx>
---
kernel/locking/osq_lock.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a37857ab55..bed5a56a6aed 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -1,6 +1,7 @@
#include <linux/percpu.h>
#include <linux/sched.h>
#include <linux/osq_lock.h>
+#include <linux/vcpu_preempt.h>

/*
* An MCS like lock especially tailored for optimistic spinning for sleeping
@@ -87,6 +88,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
int old;
+ int loops;
+ long vpc;

node->locked = 0;
node->next = NULL;
@@ -106,6 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
node->prev = prev;
WRITE_ONCE(prev->next, node);

+ old = old - 1;
+ vpc = vcpu_preempt_count();
+
/*
* Normally @prev is untouchable after the above store; because at that
* moment unlock can proceed and wipe the node element from stack.
@@ -118,8 +124,14 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
* If we need to reschedule bail... so we can block.
+ * An over-committed guest with more vCPUs than pCPUs
+ * might fall in this loop and cause a huge overload.
+ * This is because vCPU A(prev) hold the osq lock and yield out,
+ * vCPU B(node) wait ->locked to be set, IOW, wait till
+ * vCPU A run and unlock the osq lock.
+ * NOTE that vCPU A and vCPU B might run on same physical cpu.
*/
- if (need_resched())
+ if (need_resched() || vcpu_is_preempted(old) || vcpu_has_preempted(vpc))
goto unqueue;

cpu_relax_lowlatency();