Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

From: Neeraj Upadhyay
Date: Tue Jun 14 2022 - 08:19:22 EST

Next message: Conor.Dooley: "Re: [PATCH v2 1/2] pwm: add microchip soft ip corePWM driver"
Previous message: Uwe Kleine-König: "Re: [PATCH v2 1/2] pwm: add microchip soft ip corePWM driver"
In reply to: Shameerali Kolothum Thodi: "RE: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow"
Next in thread: zhangfei.gao@xxxxxxxxxxx: "Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On 6/14/2022 2:25 AM, Shameerali Kolothum Thodi wrote:

-----Original Message-----
From: Paul E. McKenney [mailto:paulmck@xxxxxxxxxx]
Sent: 13 June 2022 15:59
To: zhangfei.gao@xxxxxxxxxxx
Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>; Zhangfei Gao
<zhangfei.gao@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
rcu@xxxxxxxxxxxxxxx; Lai Jiangshan <jiangshanlai@xxxxxxxxx>; Josh Triplett
<josh@xxxxxxxxxxxxxxxx>; Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx>; Matthew Wilcox <willy@xxxxxxxxxxxxx>;
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@xxxxxxxxxx>;
mtosatti@xxxxxxxxxx; Auger Eric <eric.auger@xxxxxxxxxx>
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and
blocking readers from consuming CPU) cause qemu boot slow

On Mon, Jun 13, 2022 at 09:23:50PM +0800, zhangfei.gao@xxxxxxxxxxx
wrote:

On 2022/6/13 下午8:18, Paul E. McKenney wrote:

On Mon, Jun 13, 2022 at 02:55:47PM +0800, zhangfei.gao@xxxxxxxxxxx

wrote:

Hi, Paul

On 2022/6/13 下午12:16, Paul E. McKenney wrote:

On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:

On Mon, Jun 13, 2022 at 11:04:39AM +0800,

zhangfei.gao@xxxxxxxxxxx wrote:

Hi, Paul

On 2022/6/13 上午2:49, Paul E. McKenney wrote:

On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini

wrote:

On 6/12/22 18:40, Paul E. McKenney wrote:

Do these reserved memory regions really need to be

allocated separately?

(For example, are they really all non-contiguous? If not,

that is, if

there are a lot of contiguous memory regions, could you

sort the IORT

by address and do one ioctl() for each set of contiguous

memory regions?)

Are all of these reserved memory regions set up before init

is spawned?

Are all of these reserved memory regions set up while

there is only a

single vCPU up and running?

Is the SRCU grace period really needed in this case? (I

freely confess

to not being all that familiar with KVM.)

Oh, and there was a similar many-requests problem with

networking many

years ago. This was solved by adding a new

syscall/ioctl()/whatever

that permitted many requests to be presented to the kernel

with a single

system call.

Could a new ioctl() be introduced that requested a large

number

of these memory regions in one go so as to make each call to
synchronize_rcu_expedited() cover a useful fraction of your

9000+

requests? Adding a few of the KVM guys on CC for their

thoughts.

Unfortunately not. Apart from this specific case, in general

the calls to

KVM_SET_USER_MEMORY_REGION are triggered by writes to

I/O registers in the

guest, and those writes then map to a ioctl. Typically the

guest sets up a

device at a time, and each setup step causes a

synchronize_srcu()---and

expedited at that.

I was afraid of something like that...

KVM has two SRCUs:

1) kvm->irq_srcu is hardly relying on the "sleepable" part; it

has readers

that are very very small, but it needs extremely fast detection

of grace

periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
KVM_SET_GSI_ROUTING", 2014-05-05) which split it off

kvm->srcu. Readers are

not so frequent.

2) kvm->srcu is nastier because there are readers all the time.

The

read-side critical section are still short-ish, but they need the

sleepable

part because they access user memory.

Which one of these two is in play in this case?

Writers are not frequent per se; the problem is they come in

very large

bursts when a guest boots. And while the whole boot path

overall can be

quadratic, O(n) expensive calls to synchronize_srcu() can have

a larger

impact on runtime than the O(n^2) parts, as demonstrated

here.

Therefore, we operated on the assumption that the callers of
synchronized_srcu_expedited were _anyway_ busy running

CPU-bound guest code

and the desire was to get past the booting phase as fast as

possible. If

the guest wants to eat host CPU it can "for(;;)" as much as it

wants;

therefore, as long as expedited GPs didn't eat CPU

*throughout the whole

system*, a preemptable busy wait in

synchronize_srcu_expedited() were not

problematic.

This assumptions did match the SRCU code when kvm->srcu

and kvm->irq_srcu

were was introduced (respectively in 2009 and 2014). But

perhaps they do

not hold anymore now that each SRCU is not as independent

as it used to be

in those years, and instead they use workqueues instead?

The problem was not internal to SRCU, but rather due to the fact
that kernel live patching (KLP) had problems with the

CPU-bound tasks

resulting from repeated synchronize_rcu_expedited()

invocations. So I

added heuristics to get the occasional sleep in there for KLP's

benefit.

Perhaps these heuristics need to be less aggressive about adding

sleep.

These heuristics have these aspects:

1. The longer readers persist in an expedited SRCU grace period,
the longer the wait between successive checks of the reader
state. Roughly speaking, we wait as long as the grace period
has currently been in effect, capped at ten jiffies.

2. SRCU grace periods have several phases. We reset so that

each

phase starts by not waiting (new phase, new set of readers,
so don't penalize this set for the sins of the previous set).
But once we get to the point of adding delay, we add the
delay based on the beginning of the full grace period.

Right now, the checking for grace-period length does not allow

for the

possibility that a grace period might start just before the jiffies
counter gets incremented (because I didn't realize that anyone

cared),

so that is one possible thing to change. I can also allow more

no-delay

checks per SRCU grace-period phase.

Zhangfei, does something like the patch shown below help?

Additional adjustments are likely needed to avoid re-breaking

KLP,

but we have to start somewhere...

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019dea..6a354368ac1d1 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct

srcu_struct *ssp)

#define SRCU_INTERVAL 1 // Base delay if no

expedited GPs pending.

#define SRCU_MAX_INTERVAL 10 // Maximum

incremental delay from slow readers.

-#define SRCU_MAX_NODELAY_PHASE 1 // Maximum

per-GP-phase consecutive no-delay instances.

+#define SRCU_MAX_NODELAY_PHASE 3 // Maximum

per-GP-phase consecutive no-delay instances.

#define SRCU_MAX_NODELAY 100 // Maximum

consecutive no-delay instances.

/*
@@ -522,12 +522,18 @@ static bool srcu_readers_active(struct

srcu_struct *ssp)

*/
static unsigned long srcu_get_delay(struct srcu_struct

*ssp)

{
+ unsigned long gpstart;
+ unsigned long j;
unsigned long jbase = SRCU_INTERVAL;
if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),

READ_ONCE(ssp->srcu_gp_seq_needed_exp)))

jbase = 0;
- if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
- jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
+ if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+ j = jiffies - 1;
+ gpstart = READ_ONCE(ssp->srcu_gp_start);
+ if (time_after(j, gpstart))
+ jbase += j - gpstart;
+ }
if (!jbase) {
WRITE_ONCE(ssp->srcu_n_exp_nodelay,

READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);

if (READ_ONCE(ssp->srcu_n_exp_nodelay) >

SRCU_MAX_NODELAY_PHASE)

Unfortunately, this patch does not helpful.

Then re-add the debug info.

During the qemu boot
[ 232.997667] __synchronize_srcu loop=1000

[ 361.094493] __synchronize_srcu loop=9000
[ 361.094501] Call trace:
[ 361.094502] dump_backtrace+0xe4/0xf0
[ 361.094505] show_stack+0x20/0x70
[ 361.094507] dump_stack_lvl+0x8c/0xb8
[ 361.094509] dump_stack+0x18/0x34
[ 361.094511] __synchronize_srcu+0x120/0x128
[ 361.094514] synchronize_srcu_expedited+0x2c/0x40
[ 361.094515] kvm_swap_active_memslots+0x130/0x198
[ 361.094519] kvm_activate_memslot+0x40/0x68
[ 361.094520] kvm_set_memslot+0x2f8/0x3b0
[ 361.094523] __kvm_set_memory_region+0x2e4/0x438
[ 361.094524] kvm_set_memory_region+0x78/0xb8
[ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
[ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
[ 361.094530] invoke_syscall+0x4c/0x110
[ 361.094533] el0_svc_common.constprop.0+0x68/0x128
[ 361.094536] do_el0_svc+0x34/0xc0
[ 361.094538] el0_svc+0x30/0x98
[ 361.094541] el0t_64_sync_handler+0xb8/0xc0
[ 361.094544] el0t_64_sync+0x18c/0x190
[ 363.942817] kvm_set_memory_region loop=6000

Huh.

One possibility is that the "if (!jbase)" block needs to be nested
within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {"

block.

I test this diff and NO helpful

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019de..36286a4b74e6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct

srcu_struct *ssp)

#define SRCU_INTERVAL          1       // Base delay if no

expedited GPs

pending.
#define SRCU_MAX_INTERVAL      10      // Maximum

incremental delay from

slow readers.
-#define SRCU_MAX_NODELAY_PHASE 1       // Maximum

per-GP-phase consecutive

no-delay instances.
+#define SRCU_MAX_NODELAY_PHASE 3       // Maximum

per-GP-phase consecutive

no-delay instances.
#define SRCU_MAX_NODELAY       100     // Maximum

consecutive no-delay

instances.

/*
@@ -522,16 +522,23 @@ static bool srcu_readers_active(struct

srcu_struct

*ssp)
*/
static unsigned long srcu_get_delay(struct srcu_struct *ssp)
{
+       unsigned long gpstart;
+       unsigned long j;
        unsigned long jbase = SRCU_INTERVAL;

        if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
                jbase = 0;
-       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
-               jbase += jiffies -

READ_ONCE(ssp->srcu_gp_start);

-       if (!jbase) {
-               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
-               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
SRCU_MAX_NODELAY_PHASE)
-                       jbase = 1;
+       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+               j = jiffies - 1;
+               gpstart = READ_ONCE(ssp->srcu_gp_start);
+               if (time_after(j, gpstart))
+                       jbase += j - gpstart;
+
+               if (!jbase) {

+                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,

READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
+                       if

(READ_ONCE(ssp->srcu_n_exp_nodelay) >

SRCU_MAX_NODELAY_PHASE)
+                               jbase = 1;
+               }
        }

That is in fact what I was intending you to test, thank you. As you
say, unfortunately it did not help.

Could you please test removing the "if (!jbase)" block entirely?

Remove "if (!jbase)" block is much faster,
not measure clearly, qemu (with debug version efi) boot seems normally.

From log timestamp:
[ 114.624713] __synchronize_srcu loop=1000
[ 124.157011] __synchronize_srcu loop=9000

Several method: timestamps are different.

5.19-rc1
[ 94.271350] __synchronize_srcu loop=1001
[ 222.621659] __synchronize_srcu loop=9001

With your first diff:
[ 232.997667] __synchronize_srcu loop=1000
[ 361.094493] __synchronize_srcu loop=9000

Remove "if (!jbase)" block
[ 114.624713] __synchronize_srcu loop=1000
[ 124.157011] __synchronize_srcu loop=9000

5.18 method
+ if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),

READ_ONCE(ssp->srcu_gp_seq_needed_exp)))

+ return 0;
+ return SRCU_INTERVAL;

[ 74.598480] __synchronize_srcu loop=9000
[ 68.938297] __synchronize_srcu loop=1000

Thank you for the information!

What happens if you keep the that "if (!jbase)" block", but set the
value of the SRCU_MAX_NODELAY_PHASE macro very large, say 1000000?

From the setup I have, this is almost similar to that of the previous logic(without
the "if(!jbase)"). In both cases, I think we are not close to 5.18, but definitely much
better compared to 5.19-rc1.

The numbers from my test setup(CONFIG_HZ_250, CONFIG_PREEMPTION=y),

Guest boot time(using 'time'):

5.18-rc4 based ~8sec

5.19-rc1 ~2m43sec

5.19-rc1+fix1 ~19sec

5.19-rc1-fix2 ~19sec

If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ; does it show any difference in boot time?

--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct *ssp, struct srcu_node *snp
*/
static void srcu_gp_end(struct srcu_struct *ssp)
{
- unsigned long cbdelay;
+ unsigned long cbdelay = 1;
bool cbs;
bool last_lvl;
int cpu;
@@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
spin_lock_irq_rcu_node(ssp);
idx = rcu_seq_state(ssp->srcu_gp_seq);
WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
- cbdelay = !!srcu_get_delay(ssp);
+ if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+ cbdelay = 0;
+
WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());

Thanks
Neeraj

I will wait for Zhangfei to confirm this on his setup, especially the difference
compared to 5.18.

Thanks,
Shameer

This would be too large for KLP, but my hope is that there is a value
of SRCU_MAX_NODELAY_PHASE that works for everyone. But first, does
this help at all? ;-)

And when I run 10,000 consecutive synchronize_rcu_expedited() calls,

the

above change reduces the overhead by more than an order of

magnitude.

Except that the overhead of the series is far less than one second,
not the several minutes that you are seeing. So the per-call

overhead

decreases from about 17 microseconds to a bit more than one

microsecond.

I could imagine an extra order of magnitude if you are running

HZ=100

instead of the HZ=1000 that I am running. But that only gets up to a
few seconds.

One possible reason for the difference would be if your code has
SRCU readers.

Could you please tell me the value of CONFIG_HZ on your system?
Also the value of CONFIG_PREEMPTION?

I am using arch/arm64/configs/defconfig
make defconfig
CONFIG_PREEMPTION=y
CONFIG_HZ_250=y

Thank you again!

And if there is a good value of SRCU_MAX_NODELAY_PHASE, it might
depend
on HZ. And who knows what all else...

Thanx, Paul

Thanks

One additional debug is to apply the patch below on top of the one

you

apply the patch below?

just now kindly tested, then use whatever debug technique you wish

to

work out what fraction of the time during that critical interval that
srcu_get_delay() returns non-zero.

Sorry, I am confused, no patch right?

Apologies, my omission.

Just measure srcu_get_delay return to non-zero?

Exactly, please!

By the way, the issue should be only related with qemu apci. not related
with rmr feature
Test with: https://github.com/qemu/qemu/tree/stable-6.1

Looks it caused by too many kvm_region_add & kvm_region_del if

acpi=force,

If no acpi, no print kvm_region_add/del (1000 times print once)

If with acpi=force,
During qemu boot
kvm_region_add region_add = 1000
kvm_region_del region_del = 1000
kvm_region_add region_add = 2000
kvm_region_del region_del = 2000
kvm_region_add region_add = 3000
kvm_region_del region_del = 3000
kvm_region_add region_add = 4000
kvm_region_del region_del = 4000
kvm_region_add region_add = 5000
kvm_region_del region_del = 5000
kvm_region_add region_add = 6000
kvm_region_del region_del = 6000

kvm_region_add/kvm_region_del ->
kvm_set_phys_mem->
kvm_set_user_memory_region->
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)

[ 361.094493] __synchronize_srcu loop=9000
[ 361.094501] Call trace:
[ 361.094502] dump_backtrace+0xe4/0xf0
[ 361.094505] show_stack+0x20/0x70
[ 361.094507] dump_stack_lvl+0x8c/0xb8
[ 361.094509] dump_stack+0x18/0x34
[ 361.094511] __synchronize_srcu+0x120/0x128
[ 361.094514] synchronize_srcu_expedited+0x2c/0x40
[ 361.094515] kvm_swap_active_memslots+0x130/0x198
[ 361.094519] kvm_activate_memslot+0x40/0x68
[ 361.094520] kvm_set_memslot+0x2f8/0x3b0
[ 361.094523] __kvm_set_memory_region+0x2e4/0x438
[ 361.094524] kvm_set_memory_region+0x78/0xb8
[ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
[ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
[ 361.094530] invoke_syscall+0x4c/0x110
[ 361.094533] el0_svc_common.constprop.0+0x68/0x128
[ 361.094536] do_el0_svc+0x34/0xc0
[ 361.094538] el0_svc+0x30/0x98
[ 361.094541] el0t_64_sync_handler+0xb8/0xc0
[ 361.094544] el0t_64_sync+0x18c/0x190
[ 363.942817] kvm_set_memory_region loop=6000

Good to know, thank you!

Thanx, Paul

Next message: Conor.Dooley: "Re: [PATCH v2 1/2] pwm: add microchip soft ip corePWM driver"
Previous message: Uwe Kleine-König: "Re: [PATCH v2 1/2] pwm: add microchip soft ip corePWM driver"
In reply to: Shameerali Kolothum Thodi: "RE: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow"
Next in thread: zhangfei.gao@xxxxxxxxxxx: "Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]