Re: [PATCH] rcu: Benefit from expedited grace period in __wait_rcu_gp

From: Paul E. McKenney
Date: Wed Oct 24 2018 - 07:20:14 EST


On Tue, Oct 23, 2018 at 03:13:43PM +0000, Raslan, KarimAllah wrote:
> On Fri, 2018-10-19 at 13:21 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 19, 2018 at 07:45:51PM +0000, Raslan, KarimAllah wrote:
> > >
> > > On Fri, 2018-10-19 at 05:31 -0700, Paul E. McKenney wrote:
> > > >
> > > > On Fri, Oct 19, 2018 at 02:49:05AM +0200, KarimAllah Ahmed wrote:
> > > > >
> > > > >
> > > > > When expedited grace-period is set, both synchronize_sched
> > > > > synchronize_rcu_bh can be optimized to have a significantly lower latency.
> > > > >
> > > > > Improve wait_rcu_gp handling to also account for expedited grace-period.
> > > > > The downside is that wait_rcu_gp will not wait anymore for all RCU variants
> > > > > concurrently when an expedited grace-period is set, however, given the
> > > > > improved latency it does not really matter.
> > > > >
> > > > > Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > > > > Cc: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
> > > > > Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> > > > > Cc: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> > > > > Cc: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
> > > > > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > > > > Signed-off-by: KarimAllah Ahmed <karahmed@xxxxxxxxx>
> > > >
> > > > Cute!
> > > >
> > > > Unfortunately, there are a few problems with this patch:
> > > >
> > > > 1. I will be eliminating synchronize_rcu_mult() due to the fact that
> > > > the upcoming RCU flavor consolidation eliminates its sole caller.
> > > > See 5fc9d4e000b1 ("rcu: Eliminate synchronize_rcu_mult()")
> > > > in my -rcu tree. This would of course also eliminate the effects
> > > > of this patch.
> > >
> > > Your patch covers our use-case already, but I still think that the semantics 
> > > for wait_rcu_gp is not clear to me.
> > >
> > > The problem for us was that sched_cpu_deactivate would call
> > > synchronize_rcu_mult which does not check for "expedited" at all. So even
> > > though we are already using rcu_expedited sysctl variable, synchronize_rcu_mult 
> > > was just ignoring it.
> > >
> > > That being said, I indeed overlooked rcu_normal and that it takes precedence 
> > > over expedited and I did not notice at all the deadlock you mentioned below!
> > >
> > > That can however be easily fixed by also checking for !rcu_gp_is_normal.
> >
> > ???
> >
> > The aforementioned 5fc9d4e000b1 commit replaces the synchronize_rcu_mult()
> > with synchronize_rcu(), which really would be subject to the sysfs
> > variables. Of course, this is not yet in mainline, so it perhaps cannot
> > solve your immediate problem, which probably involve older kernels in
> > any case. More on this below...
> >
> > >
> > > >
> > > > 2. The real-time guys' users are not going to be at all happy
> > > > with the IPIs resulting from the _expedited() API members.
> > > > Yes, they can boot with rcupdate.rcu_normal=1, but they don't
> > > > always need that big a hammer, and use of this kernel parameter
> > > > can slow down boot, hibernation, suspend, network configuration,
> > > > and much else besides. We therefore don't want them to have to
> > > > use rcupdate.rcu_normal=1 unless absolutely necessary.
> > >
> > > I might be missing something here. Why would they need to "explicitly" use 
> > > rcu_normal? If rcu_expedited is set, would not the expected behavior is to call 
> > > into the expedited version?
> > >
> > > My patch should only activate *expedited* if and only if it is set.
> >
> > You are right, I was confused. However...
> >
> > >
> > > I think I might be misunderstanding the expected behavior 
> > > from synchronize_rcu_mult. My understanding is that something like:
> > >
> > > synchronize_rcu_mult(call_rcu_sched) and synchronize_rcu() should have an 
> > > identical behavior, right?
> >
> > You would clearly prefer that it did, and the commit log does seem to
> > read that way, but synchronize_rcu_mult() is going away anyway, so there
> > isn't a whole lot of point in arguing about what it should have done.
> > And the eventual implementation (with 5fc9d4e000b1 or its successor)
> > will act as you want.
> >
> > >
> > > At least in this commit:
> > >
> > > commit d7d34d5e46140 ("sched: Rely on synchronize_rcu_mult() de-duplication")
> > >
> > > .. the change clearly gives the impression that they can be used 
> > > interchangeably. The problem is that this is not true when you look at the 
> > > implementation. One of them (i.e. synchronize_rcu) will respect the
> > > expedite_rcu flag set by sysfs while the other (i.e. synchronize_rcu_mult) 
> > > simply ignores it.
> > >
> > > So my patch is about making sure that both of the variants actually respect 
> > > it.
> >
> > I am guessing that you need to make an older version of the kernel
> > expedite the CPU-hotplug grace periods. I am also guessing that you can
> > carry patches to your kernels. If so, I suggest the following simpler
> > change to sched_cpu_deactivate() in kernel/sched/core.c:
> >
> > if (rcu_gp_is_expedited()) {
> > synchronize_sched_expedited();
> > if (IS_ENABLED(CONFIG_PREEMPT))
> > synchronize_rcu_expedited();
> > } else {
> > synchronize_rcu_mult(call_rcu, call_rcu_sched);
> > }
> >
> > As soon as this patch conflicts due to the synchronize_rcu_mult()
> > becoming synchronize_rcu(), you can drop the patch. And this is the only
> > use of synchronize_rcu_mult(), so this approach loses no generality.
> > Longer term, this patch might possibly be the backport of 5fc9d4e000b1
> > back to v4.14, but at the end of the day this is up to the various
> > -stable maintainers.
> >
> > Hmmm... If you are running CONFIG_PREEMPT=n, you can use an even
> > simpler replacement for synchronize_rcu_mult():
> >
> > synchronize_sched_expedited(); /* Bug if CONFIG_PREEMPT=y!!! */
> >
> > Would either of those two approaches work for you, or am I still missing
> > something?
>
> Sorry, I was not clear. The original commit in your -rcu tree already works for 
> us. So that is sorted out, thank you :)

Very good, thank you!

Just to confirm:

1. By "original commit", you mean 5fc9d4e000b1 ("rcu: Eliminate
synchronize_rcu_mult()"), right?

2. Please note that 5fc9d4e000b1 can only be backported to the
upcoming v4.20 release, and given that it is only an uncommon-case
performance regression, I would not submit it to -stable at all.
So if you need this performance regression to work in v4.19 or
earlier, please:

a. Let me know.

b. Test this replacement for the current call to
synchronize_rcu_mult() from sched_cpu_deactivate()
in kernel/sched/core.c:

if (rcu_gp_is_expedited()) {
synchronize_sched_expedited();
if (IS_ENABLED(CONFIG_PREEMPT))
synchronize_rcu_expedited();
} else {
synchronize_rcu_mult(call_rcu, call_rcu_sched);
}

> In my previous reply, when I was referring to synchronize_rcu_mult I really
> also wanted to also point to __wait_rcu_gp. With the current state in your -rcu 
> tree, __wait_rcu_gp does not look whether "expedited" is enabled or not. This
> is currently delegated to the callers (e.g. synchronize_rcu and 
> synchronize_rcu_bh). So any new direct users of __wait_rcu_gp would also miss 
> this check (i.e. just like what happened in synchronize_rcu_mult). If we want 
> __wait_rcu_gp to always respect rcu_expedited and rcu_normal flags, we might 
> want to pull these checks into __wait_rcu_gp instead and remove them from the 
> callers.

Yes, __wait_rcu_gp() does arguably have odd semantics. It is after all
used to implement the rcu_barrier() call in Tiny RCU and waits for a
normal (not expedited!) grace period in Tree RCU. But there are many
other places in RCU where the semantics are quite a bit more unusual.

So welcome to my world! ;-)

Thanx, Paul

> > > > 3. If the real-time guys' users were to have booted with
> > > > rcupdate.rcu_normal=1, then synchronize_sched_expedited()
> > > > would invoke _synchronize_rcu_expedited, which would invoke
> > > > wait_rcu_gp(), which would invoke _wait_rcu_gp() which would
> > > > invoke __wait_rcu_gp(), which, given your patch, would in turn
> > > > invoke synchronize_sched_expedited(). This situation could
> > > > well prevent their systems from meeting their response-time
> > > > requirements.
> > > >
> > > > So I cannot accept this patch nor for that matter any similar patch.
> > > >
> > > > But what were you really trying to get done here? If you were thinking
> > > > of adding another synchronize_rcu_mult(), the flavor consolidation will
> > > > make that unnecessary in most cases. If you are trying to speed up
> > > > CPU-hotplug operations, I suggest using the rcu_expedited sysctl variable
> > > > when taking a CPU offline. If something else, please let me know what
> > > > it is so that we can work out how the problem might best be solved.
> > > >
> > > > Thanx, Paul
> > > >
> > > > >
> > > > >
> > > > > ---
> > > > > kernel/rcu/update.c | 34 ++++++++++++++++++++++++++++------
> > > > > 1 file changed, 28 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > > > index 68fa19a..44b8817 100644
> > > > > --- a/kernel/rcu/update.c
> > > > > +++ b/kernel/rcu/update.c
> > > > > @@ -392,13 +392,27 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
> > > > > might_sleep();
> > > > > continue;
> > > > > }
> > > > > - init_rcu_head_on_stack(&rs_array[i].head);
> > > > > - init_completion(&rs_array[i].completion);
> > > > > +
> > > > > for (j = 0; j < i; j++)
> > > > > if (crcu_array[j] == crcu_array[i])
> > > > > break;
> > > > > - if (j == i)
> > > > > - (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
> > > > > + if (j != i)
> > > > > + continue;
> > > > > +
> > > > > + if ((crcu_array[i] == call_rcu_sched ||
> > > > > + crcu_array[i] == call_rcu_bh)
> > > > > + && rcu_gp_is_expedited()) {
> > > > > + if (crcu_array[i] == call_rcu_sched)
> > > > > + synchronize_sched_expedited();
> > > > > + else
> > > > > + synchronize_rcu_bh_expedited();
> > > > > +
> > > > > + continue;
> > > > > + }
> > > > > +
> > > > > + init_rcu_head_on_stack(&rs_array[i].head);
> > > > > + init_completion(&rs_array[i].completion);
> > > > > + (crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
> > > > > }
> > > > >
> > > > > /* Wait for all callbacks to be invoked. */
> > > > > @@ -407,11 +421,19 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
> > > > > (crcu_array[i] == call_rcu ||
> > > > > crcu_array[i] == call_rcu_bh))
> > > > > continue;
> > > > > +
> > > > > + if ((crcu_array[i] == call_rcu_sched ||
> > > > > + crcu_array[i] == call_rcu_bh)
> > > > > + && rcu_gp_is_expedited())
> > > > > + continue;
> > > > > +
> > > > > for (j = 0; j < i; j++)
> > > > > if (crcu_array[j] == crcu_array[i])
> > > > > break;
> > > > > - if (j == i)
> > > > > - wait_for_completion(&rs_array[i].completion);
> > > > > + if (j != i)
> > > > > + continue;
> > > > > +
> > > > > + wait_for_completion(&rs_array[i].completion);
> > > > > destroy_rcu_head_on_stack(&rs_array[i].head);
> > > > > }
> > > > > }
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > Amazon Development Center Germany GmbH
> > > Berlin - Dresden - Aachen
> > > main office: Krausenstr. 38, 10117 Berlin
> > > Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> > > Ust-ID: DE289237879
> > > Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
> >
> >
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B