Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

From: zhangfei.gao@xxxxxxxxxxx
Date: Fri Jun 24 2022 - 11:32:19 EST




On 2022/6/20 下午3:50, Neeraj Upadhyay wrote:
Hi,


On 6/18/2022 8:37 AM, zhangfei.gao@xxxxxxxxxxx wrote:


On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
Hi,

On 6/15/2022 2:33 PM, zhangfei.gao@xxxxxxxxxxx wrote:


On 2022/6/14 下午10:17, Paul E. McKenney wrote:
On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@xxxxxxxxxxx wrote:

On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
5.18-rc4 based ~8sec

5.19-rc1                     ~2m43sec

5.19-rc1+fix1                 ~19sec

5.19-rc1-fix2                 ~19sec

If you try below diff on top of either 5.19-rc1+fix1 or 5.19-rc1-fix2 ;
does it show any difference in boot time?

--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct srcu_struct
*ssp, struct srcu_node *snp
   */
  static void srcu_gp_end(struct srcu_struct *ssp)
  {
-       unsigned long cbdelay;
+       unsigned long cbdelay = 1;
         bool cbs;
         bool last_lvl;
         int cpu;
@@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct *ssp)
         spin_lock_irq_rcu_node(ssp);
         idx = rcu_seq_state(ssp->srcu_gp_seq);
         WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
-       cbdelay = !!srcu_get_delay(ssp);
+       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
+               cbdelay = 0;
+
         WRITE_ONCE(ssp->srcu_last_gp_end, ktime_get_mono_fast_ns());
Thank you both for the testing and the proposed fix!

Test here:
qemu: https://github.com/qemu/qemu/tree/stable-6.1
kernel:
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
(in case test patch not clear, push in git tree)

Hardware: aarch64

1. 5.18-rc6
real    0m8.402s
user    0m3.015s
sys     0m1.102s

2. 5.19-rc1
real    2m41.433s
user    0m3.097s
sys     0m1.177s

3. 5.19-rc1 + fix1 from Paul
real    2m43.404s
user    0m2.880s
sys     0m1.214s

4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
real    0m15.262s
user    0m3.003s
sys     0m1.033s

When build kernel in the meantime, load time become longer.

5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
real    0m15.215s
user    0m2.942s
sys    0m1.172s

6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end 
real    1m23.936s
user    0m2.969s
sys    0m1.181s
And thank you for the testing!

Could you please try fix3 + Neeraj's change of srcu_gp_end?

That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
srcu_gp_end.

Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
times start rising?  This is probably best done by starting with
SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
until boot time becomes slow, followed by a binary search between the
last two values.  (The idea is to bias the search so that fast boot
times are the common case.)

SRCU_MAX_NODELAY_PHASE 100 becomes slower.


8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000

real 0m11.154s ~12s

user 0m2.919s

sys 0m1.064s



9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000

real 0m11.258s

user 0m3.113s

sys 0m1.073s



10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100

real 0m30.053s ~ 32s

user 0m2.827s

sys 0m1.161s



By the way, if build kernel on the board in the meantime (using memory), time become much longer.

real 1m2.763s



11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000

real 0m11.443s

user 0m3.022s

sys 0m1.052s



This is useful data, thanks! Did you get chance to check between 100 and 1000, to narrow down further, from which point (does need to be exact value) between 100 and 1000,  you start seeing degradation at, for ex. 250, 500 , ...?

Is it also possible to try experiment 10 and 11 with below diff.
What I have done in below diff is, call srcu_get_delay() only once
in try_check_zero() (and not for every loop iteration); also
retry with a different delay for the extra iteration which is done
when srcu_get_delay(ssp) returns 0.

Once we have this data, can you also try by changing SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.

#define SRCU_RETRY_CHECK_LONG_DELAY  100

-------------------------------------------------------------------------
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 6a354368ac1d..3aff2f3e99ab 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
  * we repeatedly block for 1-millisecond time periods.
  */
 #define SRCU_RETRY_CHECK_DELAY         5
+#define SRCU_RETRY_CHECK_LONG_DELAY            5

 /*
  * Start an SRCU grace period.
@@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp,
  */
 static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount)
 {
+       unsigned long curdelay;
+       curdelay = !srcu_get_delay(ssp);
        for (;;) {
                if (srcu_readers_active_idx_check(ssp, idx))
                        return true;
-               if (--trycount + !srcu_get_delay(ssp) <= 0)
+               if (--trycount + curdelay <= 0)
                        return false;
-               udelay(SRCU_RETRY_CHECK_DELAY);
+               if (trycount)
+                       udelay(SRCU_RETRY_CHECK_DELAY);
+               else
+                       udelay(SRCU_RETRY_CHECK_LONG_DELAY);
        }
 }


11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
  real    0m11.443
s user    0m3.022
s sys    0m1.052s

fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500

real    0m11.401s
user    0m2.798s
sys     0m1.328s


fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250

     real    0m15.748s
     user    0m2.781s
     sys     0m1.294s


fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200

     real    0m20.704s  21
     user    0m2.954s
     sys     0m1.226s

fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150

     real    0m25.151s
     user    0m2.980s
     sys     0m1.256s


fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
real    0m30.053s ~ 32s
  user    0m2.827s
  sys    0m1.161s


fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5

     real    0m19.263s
     user    0m3.018s
     sys     0m1.211s



     fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 100

     real    0m9.347s
     user    0m3.132s
     sys     0m1.041s



Thanks. From the data and experiments done, looks to me that we get
comparable (to 5.18-rc4 ) timings, when we retry without sleep for time duration close to 4-5 ms, which could be closer to the configured HZ (as it is 250)? Is it possible to try below configuration on top of  fix15?
If possible can you try with both HZ_1000 and HZ_250?
As multiple fixes are getting combined in experiments, for clarity, please also share the diff of srcutree.c (on top of baseline) for all experiments.

16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1 jiffy)


#define SRCU_MAX_NODELAY_TRY_CHECK_PHASE       10
#define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2)
#define SRCU_RETRY_CHECK_LONG_DELAY           \
  (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)


17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 2 jiffy)

#define SRCU_RETRY_CHECK_LONG_DELAY        \
     (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)

18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1/2 jiffy)

#define SRCU_RETRY_CHECK_LONG_DELAY      \
       (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2)

    fix16: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1 jiffy)

    real    0m10.120s
    user    0m3.885s
    sys     0m1.040s


    fix17: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 2 jiffy)

    real    0m9.851s
    user    0m3.886s
    sys     0m1.011s


    fix18: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + (long delay scaled to 1/2 jiffy)

    real    0m9.741s
    user    0m3.837s
    sys     0m1.060s

code push to
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test