Re: Severe performance regression w/ 4.4+ on Android due to cgroup locking changes

From: Paul E. McKenney
Date: Wed Jul 13 2016 - 19:04:01 EST


On Wed, Jul 13, 2016 at 03:39:37PM -0700, John Stultz wrote:
> On Wed, Jul 13, 2016 at 3:17 PM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Wed, Jul 13, 2016 at 02:46:37PM -0700, John Stultz wrote:
> >> On Wed, Jul 13, 2016 at 2:42 PM, Paul E. McKenney
> >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >> > On Wed, Jul 13, 2016 at 02:18:41PM -0700, Paul E. McKenney wrote:
> >> >> On Wed, Jul 13, 2016 at 05:05:26PM -0400, Tejun Heo wrote:
> >> >> > On Wed, Jul 13, 2016 at 02:03:15PM -0700, Paul E. McKenney wrote:
> >> >> > > Take the patch that I just sent out and make the choice of normal
> >> >> > > vs. expedited depend on CONFIG_PREEMPT_RT or whatever the -rt guys are
> >> >> > > calling it these days. Is there a low-latency Kconfig option other
> >> >> > > than CONFIG_NO_HZ_FULL?
> >> >> >
> >> >> > Sounds like a plan to me.
> >> >>
> >> >> I like the way we like each other's idea. Mutually assured laziness? ;-)
> >> >
> >> > But here is what mine might look like. Untested, probably does
> >> > not even build. Note that the default is -no- expediting, use the
> >> > rcusync.expedited kernel parameter to enable it.
> >>
> >> I was working on something similar, but using a config option. Would
> >> adding a config option for the default make sense here, since I'd
> >> probably prefer to have one less thing to always specify on the
> >> cmdline?
> >
> > As long as you don't mind it depending on CONFIG_RCU_EXPERT, no problem.
> >
> > Perhaps like the following, on top of the previous patch?
> >
> > Or if you are going to put it in defconfig files only, I can make it
> > so that it isn't changeable at menuconfig time.
>
> I think having it discoverable via menuconfig is useful, and I've got
> no objections to it being under RCU_EXPERT
> (assuming I don't badly muck up my RCU settings accidentally :).

But isn't mucking up your RCU settings half of the fun? ;-)

> I only had that one nit about maybe wanting to put something in dmesg
> when we're using the expedited methods.

Updated, please see below.

> But otherwise both patches look great and are working well!
>
> Do you mind marking them both for stable 4.4+?

OK, looks like it does qualify in the "fix a notable performance or
interactivity issue" category.

> Tested-by: John Stultz <john.stultz@xxxxxxxxxx>
> Acked-by: John Stultz <john.stultz@xxxxxxxxxx>
>
> Also, do make sure Dmitry gets the reported-by credit for the first patch.

Done! The updated first patch is below, and the second will follow.

Thanx, Paul

------------------------------------------------------------------------

commit 59435eb836ee73b30ed6ada525125b67b4029321
Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Date: Wed Jul 13 14:43:46 2016 -0700

rcu: Provide rcusync.expedited kernel boot parameter

Dmitry Shmidt and John Stultz noticed that __cgroup_procs_write()
sometimes incurred excessive overheads, ranging up into the tens of
milliseconds. Further testing confirmed speculation that this was due
to synchronize_sched() within rcusync being invoked by per-CPU rwsems.
This testing also showed that substituting synchronize_sched_expedited()
for synchronize_sched() greatly reduced the overheads to below 200
microseconds, with the occasional excursion into the low single digits
worth of milliseconds.

This commit therefore provides a rcusync.expedited kernel boot parameter
that causes rcusync to use expedited grace-period primitives.

Reported-by: Dmitry Shmidt <dimitrysh@xxxxxxxxxx>
Reported-by: John Stultz <john.stultz@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Tested-by: John Stultz <john.stultz@xxxxxxxxxx>
Acked-by: John Stultz <john.stultz@xxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx> # 4.4.x-

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 82b42c958d1c..b8bc9854e548 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3229,6 +3229,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
energy efficiency by requiring that the kthreads
periodically wake up to do the polling.

+ rcusync.expedited [KNL]
+ Specify that the rcusync mechanism use expedited
+ grace periods. As of mid-2016, this affects
+ per-CPU rwsems.
+
rcutree.blimit= [KNL]
Set maximum number of finished RCU callbacks to
process in one batch.
diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index be922c9f3d37..0d0dc992cce7 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -22,6 +22,14 @@

#include <linux/rcu_sync.h>
#include <linux/sched.h>
+#include <linux/moduleparam.h>
+#include <linux/module.h>
+
+MODULE_ALIAS("rcusync");
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+#define MODULE_PARAM_PREFIX "rcusync."

#ifdef CONFIG_PROVE_RCU
#define __INIT_HELD(func) .held = func,
@@ -29,14 +37,14 @@
#define __INIT_HELD(func)
#endif

-static const struct {
+static struct {
void (*sync)(void);
void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
void (*wait)(void);
#ifdef CONFIG_PROVE_RCU
int (*held)(void);
#endif
-} gp_ops[] = {
+} gp_ops[] __read_mostly = {
[RCU_SYNC] = {
.sync = synchronize_rcu,
.call = call_rcu,
@@ -62,6 +70,21 @@ enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };

#define rss_lock gp_wait.lock

+static bool expedited;
+module_param(expedited, bool, 0444);
+
+static int __init rcu_sync_early_init(void)
+{
+ if (expedited) {
+ pr_info("RCU_SYNC: Expedited operation in effect.\n");
+ gp_ops[RCU_SYNC].sync = synchronize_rcu_expedited;
+ gp_ops[RCU_SCHED_SYNC].sync = synchronize_sched_expedited;
+ gp_ops[RCU_BH_SYNC].sync = synchronize_rcu_bh_expedited;
+ }
+ return 0;
+}
+early_initcall(rcu_sync_early_init);
+
#ifdef CONFIG_PROVE_RCU
void rcu_sync_lockdep_assert(struct rcu_sync *rsp)
{