Re: [PATCH 1/2] x86: separating entry text section

From: Jiri Olsa
Date: Mon Mar 07 2011 - 05:45:04 EST

Next message: Jiri Olsa: "Re: [PATCHv2] trace: adding unstable sched clock note to the warning"
Previous message: Sedat Dilek: "Re: linux-next: manual merge of the block tree with Linus' tree"
Next in thread: Ingo Molnar: "Re: [PATCH 1/2] x86: separating entry text section"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

hi,
any feedback?

thanks,
jirka

On Tue, Feb 22, 2011 at 01:52:01PM +0100, Jiri Olsa wrote:
> On Tue, Feb 22, 2011 at 09:09:34AM +0100, Ingo Molnar wrote:
> >
> > * Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> >
> > > Putting x86 entry code to the separate section: .entry.text.
> >
> > Trying to apply your patch i noticed one detail:
> >
> > > before patch:
> > > 26282174 L1-icache-load-misses ( +- 0.099% ) (scaled from 81.00%)
> > > 0.206651959 seconds time elapsed ( +- 0.152% )
> > >
> > > after patch:
> > > 24237651 L1-icache-load-misses ( +- 0.117% ) (scaled from 80.96%)
> > > 0.210509948 seconds time elapsed ( +- 0.140% )
> >
> > So time elapsed actually went up.
> >
> > hackbench is notoriously unstable when it comes to runtime - and increasing the
> > --repeat value only has limited effects on that.
> >
> > Dropping all system caches:
> >
> > echo 1 > /proc/sys/vm/drop_caches
> >
> > Seems to do a better job of 'resetting' system state, but if we put that into the
> > measured workload then the results are all over the place (as we now depend on IO
> > being done):
> >
> > # cat hb10
> >
> > echo 1 > /proc/sys/vm/drop_caches
> > ./hackbench 10
> >
> > # perf stat --repeat 3 ./hb10
> >
> > Time: 0.097
> > Time: 0.095
> > Time: 0.101
> >
> > Performance counter stats for './hb10' (3 runs):
> >
> > 21.351257 task-clock-msecs # 0.044 CPUs ( +- 27.165% )
> > 6 context-switches # 0.000 M/sec ( +- 34.694% )
> > 1 CPU-migrations # 0.000 M/sec ( +- 25.000% )
> > 410 page-faults # 0.019 M/sec ( +- 0.081% )
> > 25,407,650 cycles # 1189.984 M/sec ( +- 49.154% )
> > 25,407,650 instructions # 1.000 IPC ( +- 49.154% )
> > 5,126,580 branches # 240.107 M/sec ( +- 46.012% )
> > 192,272 branch-misses # 3.750 % ( +- 44.911% )
> > 901,701 cache-references # 42.232 M/sec ( +- 12.857% )
> > 802,767 cache-misses # 37.598 M/sec ( +- 9.282% )
> >
> > 0.483297792 seconds time elapsed ( +- 31.152% )
> >
> > So here's a perf stat feature suggestion to solve such measurement problems: a new
> > 'pre-run' 'dry' command could be specified that is executed before the real 'hot'
> > run is executed. Something like this:
> >
> > perf stat --pre-run-script ./hb10 --repeat 10 ./hackbench 10
> >
> > Would do the cache-clearing before each run, it would run hackbench once (dry run)
> > and then would run hackbench 10 for real - and would repeat the whole thing 10
> > times. Only the 'hot' portion of the run would be measured and displayed in the perf
> > stat output event counts.
> >
> > Another observation:
> >
> > > 24237651 L1-icache-load-misses ( +- 0.117% ) (scaled from 80.96%)
> >
> > Could you please do runs that do not display 'scaled from' messages? Since we are
> > measuring a relatively small effect here, and scaling adds noise, it would be nice
> > to ensure that the effect persists with non-scaled events as well:
> >
> > You can do that by reducing the number of events that are measured. The PMU can not
> > measure all those L1 cache events you listed - so only use the most important one
> > and add cycles and instructions to make sure the measurements are comparable:
> >
> > -e L1-icache-load-misses -e instructions -e cycles
> >
> > Btw., there's another 'perf stat' feature suggestion: it would be nice if it was
> > possible to 'record' a perf stat run, and do a 'perf diff' over it. That would
> > compare the two runs all automatically, without you having to do the comparison
> > manually.
>
> hi,
>
> I made another test with "reseting" the system state as suggested and
> only for cache-misses together with instructions and cycles events.
>
> I can see even bigger drop of icache load misses than before
> from 19359739 to 16448709 (about 15%).
>
> The instruction/cycles count is slightly bigger in the patched
> kernel run though..
>
> perf stat --repeat 100 -e L1-icache-load-misses -e instructions -e cycles ./hackbench/hackbench 10
>
> -------------------------------------------------------------------------------
> before patch:
>
> Performance counter stats for './hackbench/hackbench 10' (100 runs):
>
> 19359739 L1-icache-load-misses ( +- 0.313% )
> 2667528936 instructions # 0.498 IPC ( +- 0.165% )
> 5352849800 cycles ( +- 0.303% )
>
> 0.205402048 seconds time elapsed ( +- 0.299% )
>
> Performance counter stats for './hackbench/hackbench 10' (500 runs):
>
> 19417627 L1-icache-load-misses ( +- 0.147% )
> 2676914223 instructions # 0.497 IPC ( +- 0.079% )
> 5389516026 cycles ( +- 0.144% )
>
> 0.206267711 seconds time elapsed ( +- 0.138% )
>
>
> -------------------------------------------------------------------------------
> after patch:
>
> Performance counter stats for './hackbench/hackbench 10' (100 runs):
>
> 16448709 L1-icache-load-misses ( +- 0.426% )
> 2698406306 instructions # 0.500 IPC ( +- 0.177% )
> 5393976267 cycles ( +- 0.321% )
>
> 0.206072845 seconds time elapsed ( +- 0.276% )
>
> Performance counter stats for './hackbench/hackbench 10' (500 runs):
>
> 16490788 L1-icache-load-misses ( +- 0.180% )
> 2717734941 instructions # 0.502 IPC ( +- 0.079% )
> 5414756975 cycles ( +- 0.148% )
>
> 0.206747566 seconds time elapsed ( +- 0.137% )
>
>
> Attaching patch with above numbers in comment.
>
> thanks,
> jirka
>
>
> ---
> Putting x86 entry code to the separate section: .entry.text.
>
> Separating the entry text section seems to have performance
> benefits with regards to the instruction cache usage.
>
> Running hackbench showed that the change compresses the icache
> footprint. The icache load miss rate went down by about 15%:
>
> before patch:
> 19417627 L1-icache-load-misses ( +- 0.147% )
>
> after patch:
> 16490788 L1-icache-load-misses ( +- 0.180% )
>
>
> Whole perf output follows.
>
> - results for current tip tree:
> Performance counter stats for './hackbench/hackbench 10' (500 runs):
>
> 19417627 L1-icache-load-misses ( +- 0.147% )
> 2676914223 instructions # 0.497 IPC ( +- 0.079% )
> 5389516026 cycles ( +- 0.144% )
>
> 0.206267711 seconds time elapsed ( +- 0.138% )
>
> - results for current tip tree with the patch applied are:
> Performance counter stats for './hackbench/hackbench 10' (500 runs):
>
> 16490788 L1-icache-load-misses ( +- 0.180% )
> 2717734941 instructions # 0.502 IPC ( +- 0.079% )
> 5414756975 cycles ( +- 0.148% )
>
> 0.206747566 seconds time elapsed ( +- 0.137% )
>
>
> wbr,
> jirka
>
>
> Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
> ---
> arch/x86/ia32/ia32entry.S | 2 ++
> arch/x86/kernel/entry_32.S | 6 ++++--
> arch/x86/kernel/entry_64.S | 6 ++++--
> arch/x86/kernel/vmlinux.lds.S | 1 +
> include/asm-generic/sections.h | 1 +
> include/asm-generic/vmlinux.lds.h | 6 ++++++
> 6 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 0ed7896..50f1630 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -25,6 +25,8 @@
> #define sysretl_audit ia32_ret_from_sys_call
> #endif
>
> + .section .entry.text, "ax"
> +
> #define IA32_NR_syscalls ((ia32_syscall_end - ia32_sys_call_table)/8)
>
> .macro IA32_ARG_FIXUP noebp=0
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index c8b4efa..f5accf8 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -65,6 +65,8 @@
> #define sysexit_audit syscall_exit_work
> #endif
>
> + .section .entry.text, "ax"
> +
> /*
> * We use macros for low-level operations which need to be overridden
> * for paravirtualization. The following will never clobber any registers:
> @@ -788,7 +790,7 @@ ENDPROC(ptregs_clone)
> */
> .section .init.rodata,"a"
> ENTRY(interrupt)
> -.text
> +.section .entry.text, "ax"
> .p2align 5
> .p2align CONFIG_X86_L1_CACHE_SHIFT
> ENTRY(irq_entries_start)
> @@ -807,7 +809,7 @@ vector=FIRST_EXTERNAL_VECTOR
> .endif
> .previous
> .long 1b
> - .text
> + .section .entry.text, "ax"
> vector=vector+1
> .endif
> .endr
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 891268c..39f8d21 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -61,6 +61,8 @@
> #define __AUDIT_ARCH_LE 0x40000000
>
> .code64
> + .section .entry.text, "ax"
> +
> #ifdef CONFIG_FUNCTION_TRACER
> #ifdef CONFIG_DYNAMIC_FTRACE
> ENTRY(mcount)
> @@ -744,7 +746,7 @@ END(stub_rt_sigreturn)
> */
> .section .init.rodata,"a"
> ENTRY(interrupt)
> - .text
> + .section .entry.text
> .p2align 5
> .p2align CONFIG_X86_L1_CACHE_SHIFT
> ENTRY(irq_entries_start)
> @@ -763,7 +765,7 @@ vector=FIRST_EXTERNAL_VECTOR
> .endif
> .previous
> .quad 1b
> - .text
> + .section .entry.text
> vector=vector+1
> .endif
> .endr
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index e70cc3d..459dce2 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -105,6 +105,7 @@ SECTIONS
> SCHED_TEXT
> LOCK_TEXT
> KPROBES_TEXT
> + ENTRY_TEXT
> IRQENTRY_TEXT
> *(.fixup)
> *(.gnu.warning)
> diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
> index b3bfabc..c1a1216 100644
> --- a/include/asm-generic/sections.h
> +++ b/include/asm-generic/sections.h
> @@ -11,6 +11,7 @@ extern char _sinittext[], _einittext[];
> extern char _end[];
> extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
> extern char __kprobes_text_start[], __kprobes_text_end[];
> +extern char __entry_text_start[], __entry_text_end[];
> extern char __initdata_begin[], __initdata_end[];
> extern char __start_rodata[], __end_rodata[];
>
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index fe77e33..906c3ce 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -424,6 +424,12 @@
> *(.kprobes.text) \
> VMLINUX_SYMBOL(__kprobes_text_end) = .;
>
> +#define ENTRY_TEXT \
> + ALIGN_FUNCTION(); \
> + VMLINUX_SYMBOL(__entry_text_start) = .; \
> + *(.entry.text) \
> + VMLINUX_SYMBOL(__entry_text_end) = .;
> +
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> #define IRQENTRY_TEXT \
> ALIGN_FUNCTION(); \
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jiri Olsa: "Re: [PATCHv2] trace: adding unstable sched clock note to the warning"
Previous message: Sedat Dilek: "Re: linux-next: manual merge of the block tree with Linus' tree"
Next in thread: Ingo Molnar: "Re: [PATCH 1/2] x86: separating entry text section"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]