Re: [sched,rcu] b84c4e08143: +3.1% will-it-scale.per_thread_ops

From: Paul E. McKenney
Date: Mon Apr 21 2014 - 21:51:07 EST

Next message: Andrew Lutomirski: "Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*"
Previous message: H. Peter Anvin: "Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*"
In reply to: Fengguang Wu: "Re: [sched,rcu] b84c4e08143: +3.1% will-it-scale.per_thread_ops"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Apr 19, 2014 at 04:11:46PM +0800, Fengguang Wu wrote:
> On Thu, Apr 17, 2014 at 06:55:03AM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 17, 2014 at 12:03:53PM +0800, Fengguang Wu wrote:
> > > Hi Paul,
> > >
> > > FYI, this improves will-it-scale/open1 throughput.
> >
> > Cool! Not a planned benefit, but I will take it. ;-)
> >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev.2014.04.14a
> > > commit b84c4e08143c98dad4b4d139f08db0b98b0d3ec4 ("sched,rcu: Make cond_resched() report RCU quiescent states")
> >
> > But how should I read the data below? I see lots of positive percentages
> > and lots of negative percentages for the delta, and all near zero for
> > standard deviation. Is the overall improvement an average of these or
> > some such? What is being measured?
>
> There are a lot of things being measured, which are shown after each
> "TOTAL". For example, to get an overview of the report:

OK, let me see if I understand this...

> grep "TOTAL" this_email
>
> 563496 ~ 0% +3.1% 581059 ~ 0% TOTAL will-it-scale.per_thread_ops
> 756894 ~ 0% +2.8% 778452 ~ 0% TOTAL will-it-scale.per_process_ops

These two look like improvements, albeit small ones.

> 0.57 ~ 0% -2.7% 0.55 ~ 0% TOTAL will-it-scale.scalability

This one looks like a degradation, again small.

> 346764 ~ 2% -74.0% 90164 ~ 1% TOTAL slabinfo.kmalloc-256.active_objs
> 10837 ~ 2% -73.9% 2824 ~ 1% TOTAL slabinfo.kmalloc-256.active_slabs
> 10837 ~ 2% -73.9% 2824 ~ 1% TOTAL slabinfo.kmalloc-256.num_slabs
> 346821 ~ 2% -73.9% 90393 ~ 1% TOTAL slabinfo.kmalloc-256.num_objs
> 105961 ~ 1% -63.0% 39153 ~ 1% TOTAL meminfo.SUnreclaim
> 26432 ~ 1% -62.9% 9814 ~ 1% TOTAL proc-vmstat.nr_slab_unreclaimable

The above all look great -- about the same amount of work done with a
lot less work on the part of the Linux kernel's memory-management system.

> 87318 ~ 0% +130.0% 200809 ~ 0% TOTAL softirqs.RCU

This one is not so good, as RCU is consuming more time an energy for
roughly the same amount of work.

> 140354 ~ 1% -47.6% 73490 ~ 0% TOTAL meminfo.Slab
> 77391 ~ 1% -46.7% 41235 ~ 2% TOTAL cpuidle.C6-NHM.usage
> 38368 ~ 2% -37.6% 23954 ~ 2% TOTAL softirqs.SCHED
> 1.24 ~ 4% -35.4% 0.80 ~ 3% TOTAL perf-profile.cpu-cycles.do_notify_resume.int_signal.close

These look like improvements, again less work for a given level of
throughput.

> 1.43 ~ 4% +41.9% 2.03 ~ 4% TOTAL perf-profile.cpu-cycles.rcu_process_callbacks.__do_softirq.irq_exit.smp_apic_timer_in

This is another aspect of RCU working harder. One good thing is that
there is less work per invocation. This could be consistent with shorter
RCU grace-period latencies, which would generally be a good thing.
(And yes, I still owe you RCU grace-period-latency measurements!)

> rupt.apic_timer_interrupt
> 1.27 ~ 3% -30.0% 0.89 ~ 6% TOTAL perf-profile.cpu-cycles.setup_object.isra.46.new_slab.__slab_alloc.kmem_cache_alloc.g

Less memory-allocation work, goodness.

> empty_filp
> 1.54 ~ 7% +35.6% 2.09 ~ 8% TOTAL perf-profile.cpu-cycles.kmem_cache_alloc.getname_flags.getname.do_sys_open.sys_open

But more of whatever this turns out to be. More allocations from within
the open() system call?

> 4.21 ~ 2% -29.1% 2.98 ~ 3% TOTAL perf-profile.cpu-cycles.link_path_walk.path_openat.do_filp_open.do_sys_open.sys_open
> 1.37 ~ 4% -23.1% 1.05 ~ 7% TOTAL perf-profile.cpu-cycles.__d_lookup_rcu.lookup_fast.link_path_walk.path_openat.do_filp
> en
> 0.88 ~17% +29.1% 1.14 ~ 9% TOTAL perf-profile.cpu-cycles.path_init.path_openat.do_filp_open.do_sys_open.sys_open

Mixed results on other aspects of the open() system call?

> 0.67 ~16% +33.6% 0.90 ~10% TOTAL perf-profile.cpu-cycles.restore_sigcontext.sys_rt_sigreturn.stub_rt_sigreturn.raise

I confess to being lost on this one.

> 3.19 ~ 1% +17.4% 3.74 ~ 5% TOTAL perf-profile.cpu-cycles.file_free_rcu.rcu_process_callbacks.__do_softirq.irq_exit.smp

More RCU softirqs means more softirq exit overhead, presumably.

> ic_timer_interrupt

???

> 4329 ~ 7% +15.2% 4986 ~ 5% TOTAL slabinfo.vm_area_struct.active_objs

But increased number of active objects would seem to hint at longer
RCU grace-period latencies, in contradiction with the previous hints
in the other direction.

> 2536 ~ 1% -75.8% 614 ~ 9% TOTAL time.involuntary_context_switches
> 32593 ~ 1% -62.1% 12349 ~ 2% TOTAL interrupts.0:IO-APIC-edge.timer

These look like improvements.

> 6934 ~ 9% +86.2% 12908 ~ 7% TOTAL interrupts.RES

Looks like a degradation. Caused by the increase in RCU softirqs?

> 490 ~ 1% -37.3% 307 ~ 1% TOTAL vmstat.system.cs
> 1639 ~ 0% -8.8% 1495 ~ 0% TOTAL vmstat.system.in
> 819681 ~ 0% -3.7% 789527 ~ 0% TOTAL interrupts.LOC

These all look like improvements.

> > > Legend:
> > > ~XX% - stddev percent
> > > [+-]XX% - change percent
> > >
> > >
> > > time.involuntary_context_switches
> > >
> > > 3500 ++------------------------------------------------------------------+
> > > | .*.. |
> > > 3000 ++ .*. *..*.. .*..*.. .*.. |
> > > *..*..*..*. *. * |
> > > | *..*.. .*.. .*..* |
> > > 2500 ++ *..*. *..*. |
> > > | |
> > > 2000 ++ |
> > > | |
> > > 1500 ++ |
> > > | |
> > > | |
> > > 1000 ++ |
> > > | O O O O O O O O O
> > > 500 O+-O-----------O--O--O--O--------O-O--O--O--O--O--------O-----O--O--+
> > >
> > >
> > > [*] bisect-good sample
> > > [O] bisect-bad sample
> >
> > So the good case increases involuntary context switches, but helps something
> > else? Or does the benefit stem from increased involuntary context switches
> > and thus less time spinning or some such?
>
> In bisect POV, branch BASE is good and HEAD is bad. Which has nothing
> to do with the improvement/regression in performance POV.
>
> Here the HEAD(bisect bad) commit has less involuntary_context_switches
> which indicates an improvement over BASE. It does look like close to
> the root cause of improved will-it-scale throughput.

Ah, that explains it!

Would it make sense to add "(HEAD)" to the legend for "[O]"?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Lutomirski: "Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*"
Previous message: H. Peter Anvin: "Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*"
In reply to: Fengguang Wu: "Re: [sched,rcu] b84c4e08143: +3.1% will-it-scale.per_thread_ops"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]