Re: [PATCH v1 00/19] Increase resolution of load weights

From: Nikhil Rao
Date: Thu May 12 2011 - 04:57:12 EST

Next message: Tejun Heo: "Re: [patch v2 0/5] percpu_counter: bug fix and enhancement"
Previous message: Shaohua Li: "Re: [patch v2 0/5] percpu_counter: bug fix and enhancement"
In reply to: Ingo Molnar: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Next in thread: Ingo Molnar: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 10, 2011 at 11:59 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
>
> * Nikhil Rao <ncrao@xxxxxxxxxx> wrote:
>
>> > Also, the above (and the other scale-adjustment changes) probably explains
>> > why the instruction count went up on 64-bit.
>>
>> Yes, that makes sense. We see an increase in instruction count of about 2%
>> with the new version of the patchset, down from 5.8% (will post the new
>> patchset soon). Assuming 30% of the cost of pipe test is scheduling, that is
>> an effective increase of approx. 6.7%. I'll post the data and some analysis
>> along with the new version.
>
> An instruction count increase does not necessarily mean a linear slowdown: if
> those instructions are cheaper or scheduled better by the CPU then often the
> slowdown will be less.
>
> Sometimes a 1% increase in the instruction count can slow down a workload by
> 5%, if the 1% increase does divisions, has complex data path dependencies or is
> missing the branch-cache a lot.
>
> So you should keep an eye on the cycle count as well. Latest -tip's perf stat
> can also measure 'stalled cycles':
>
> aldebaran:~/sched-tests> taskset 1 perf stat --repeat 3 ./pipe-test-1m
>
> ÂPerformance counter stats for './pipe-test-1m' (3 runs):
>
> Â Â Â 6499.787926 task-clock Â Â Â Â Â Â Â # Â Â0.437 CPUs utilized Â Â Â Â Â Â( +- Â0.41% )
> Â Â Â Â 2,000,108 context-switches Â Â Â Â # Â Â0.308 M/sec Â Â Â Â Â Â Â Â Â Â( +- Â0.00% )
> Â Â Â Â Â Â Â Â 0 CPU-migrations Â Â Â Â Â # Â Â0.000 M/sec Â Â Â Â Â Â Â Â Â Â( +-100.00% )
> Â Â Â Â Â Â Â 147 page-faults Â Â Â Â Â Â Â# Â Â0.000 M/sec Â Â Â Â Â Â Â Â Â Â( +- Â0.00% )
> Â Â14,226,565,939 cycles Â Â Â Â Â Â Â Â Â # Â Â2.189 GHz Â Â Â Â Â Â Â Â Â Â Â( +- Â0.49% )
> Â Â 6,897,331,129 stalled-cycles-frontend Â# Â 48.48% frontend cycles idle Â Â ( +- Â0.90% )
> Â Â 4,230,895,459 stalled-cycles-backend Â # Â 29.74% backend Âcycles idle Â Â ( +- Â1.31% )
> Â Â14,002,256,109 instructions Â Â Â Â Â Â # Â Â0.98 Âinsns per cycle
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â# Â Â0.49 Âstalled cycles per insn Â( +- Â0.02% )
> Â Â 2,703,891,945 branches Â Â Â Â Â Â Â Â # Â415.997 M/sec Â Â Â Â Â Â Â Â Â Â( +- Â0.02% )
> Â Â Â Â44,994,805 branch-misses Â Â Â Â Â Â# Â Â1.66% of all branches Â Â Â Â Â( +- Â0.27% )
>
> Â Â Â 14.859234036 Âseconds time elapsed Â( +- Â0.19% )
>
> Te stalled-cycles frontend/backend metrics indicate whether a workload utilizes
> the CPU's resources optimally. Looking at a 'perf record -e
> stalled-cycles-frontend' and 'perf report' will show you the problem areas.
>
> Most of the 'problem areas' will be unrelated to your code.
>
> A 'near perfectly utilized' CPU looks like this:
>
> aldebaran:~/opt> taskset 1 perf stat --repeat 10 ./fill_1b
>
> ÂPerformance counter stats for './fill_1b' (10 runs):
>
> Â Â Â 1880.489837 task-clock Â Â Â Â Â Â Â # Â Â0.998 CPUs utilized Â Â Â Â Â Â( +- Â0.15% )
> Â Â Â Â Â Â Â Â36 context-switches Â Â Â Â # Â Â0.000 M/sec Â Â Â Â Â Â Â Â Â Â( +- 19.87% )
> Â Â Â Â Â Â Â Â 1 CPU-migrations Â Â Â Â Â # Â Â0.000 M/sec Â Â Â Â Â Â Â Â Â Â( +- 59.63% )
> Â Â Â Â Â Â Â Â99 page-faults Â Â Â Â Â Â Â# Â Â0.000 M/sec Â Â Â Â Â Â Â Â Â Â( +- Â0.10% )
> Â Â 6,027,432,226 cycles Â Â Â Â Â Â Â Â Â # Â Â3.205 GHz Â Â Â Â Â Â Â Â Â Â Â( +- Â0.15% )
> Â Â Â Â22,138,455 stalled-cycles-frontend Â# Â Â0.37% frontend cycles idle Â Â ( +- 36.56% )
> Â Â Â Â16,400,224 stalled-cycles-backend Â # Â Â0.27% backend Âcycles idle Â Â ( +- 38.12% )
> Â Â18,008,803,113 instructions Â Â Â Â Â Â # Â Â2.99 Âinsns per cycle
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â# Â Â0.00 Âstalled cycles per insn Â( +- Â0.00% )
> Â Â 1,001,802,536 branches Â Â Â Â Â Â Â Â # Â532.735 M/sec Â Â Â Â Â Â Â Â Â Â( +- Â0.01% )
> Â Â Â Â Â Â22,842 branch-misses Â Â Â Â Â Â# Â Â0.00% of all branches Â Â Â Â Â( +- Â9.07% )
>
> Â Â Â Â1.884595529 Âseconds time elapsed Â( +- Â0.15% )
>
> Both stall counts are very low. This is pretty hard to achieve in general, so
> before/after comparisons are used. For that there's 'perf diff' which you can
> use to compare before/after profiles:
>
> Âaldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
> Â[ perf record: Woken up 2 times to write data ]
> Â[ perf record: Captured and wrote 0.427 MB perf.data (~18677 samples) ]
> Âaldebaran:~/sched-tests> taskset 1 perf record -e instructions ./pipe-test-1m
> Â[ perf record: Woken up 2 times to write data ]
> Â[ perf record: Captured and wrote 0.428 MB perf.data (~18685 samples) ]
> Âaldebaran:~/sched-tests> perf diff | head -10
> Â# Baseline ÂDelta Â Â Â Â ÂShared Object Â Â Â Â Â Â Â Â Â Â Â Â Symbol
> Â# ........ .......... Â................. Â.............................
> Â#
> Â Â 2.68% Â Â +0.84% Â[kernel.kallsyms] Â[k] select_task_rq_fair
> Â Â 3.28% Â Â -0.17% Â[kernel.kallsyms] Â[k] fsnotify
> Â Â 2.67% Â Â +0.13% Â[kernel.kallsyms] Â[k] _raw_spin_lock_irqsave
> Â Â 2.46% Â Â +0.11% Â[kernel.kallsyms] Â[k] pipe_read
> Â Â 2.42% Â Â Â Â Â Â [kernel.kallsyms] Â[k] schedule
> Â Â 2.11% Â Â +0.28% Â[kernel.kallsyms] Â[k] copy_user_generic_string
> Â Â 2.13% Â Â +0.18% Â[kernel.kallsyms] Â[k] mutex_lock
>
> Â( Note: these were two short runs on the same kernel so the diff shows the
> Â natural noise of the profile of this workload. Longer runs are needed to
> Â measure effects smaller than 1%. )
>
> So there's a wide range of tools you can use to understand the precise
> performance impact of your patch and in turn you can present to us what you
> learned about it.
>
> Such analysis saves quite a bit of time on the side of us scheduler maintainers
> and makes performance impacting patches a lot more easy to apply :-)
>

Thanks for the info! I rebased the patchset against -tip and built
perf from -tip. Here are the results from running pipe-test-100k bound
to a single cpu with 100 repetitions.

-tip (baseline):

Performance counter stats for '/root/data/pipe-test-100k' (100 runs):

907,981,999 instructions # 0.85 insns per cycle
# 0.34 stalled cycles
per insn ( +- 0.07% )
1,072,650,809 cycles # 0.000 GHz
( +- 0.13% )
305,678,413 stalled-cycles-backend # 28.50% backend cycles
idle ( +- 0.51% )
245,846,208 stalled-cycles-frontend # 22.92% frontend cycles
idle ( +- 0.70% )

1.060303165 seconds time elapsed ( +- 0.09% )

-tip+patches:

Performance counter stats for '/root/data/pipe-test-100k' (100 runs):

910,501,358 instructions # 0.82 insns per cycle
# 0.36 stalled cycles
per insn ( +- 0.06% )
1,108,981,763 cycles # 0.000 GHz
( +- 0.17% )
328,816,295 stalled-cycles-backend # 29.65% backend cycles
idle ( +- 0.63% )
247,412,614 stalled-cycles-frontend # 22.31% frontend cycles
idle ( +- 0.71% )

1.075497493 seconds time elapsed ( +- 0.10% )

>From this latest run on -tip, the instruction count is about ~0.28%
more and cycles are approx 3.38% more. From the stalled cycles counts,
it looks like most of this increase is coming from backend stalled
cycles. It's not clear what type of stalls these are, but if I were to
guess, I think it means stalls post-decode (i.e. functional units,
load/store, etc.). Is that right?

I collected profiles from long runs of pipe-test (about 3m iterations)
and tried running "perf diff" on the profiles. I cached the buildid
from the two kernel images and associated test binary & libraries. The
individual reports make sense, but I suspect something is wrong with
the diff output.

# perf buildid-cache -v -a boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV
Adding 17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
boot.tip-patches/vmlinux-2.6.39-tip-smp-DEV: Ok
# perf buildid-cache -v -a boot.tip/vmlinux-2.6.39-tip-smp-DEV
Adding 47737eb3efdd6cb789872311c354b106ec8e7477
p/boot.tip/vmlinux-2.6.39-tip-smp-DEV: Ok

# perf buildid-list -i perf.data | grep kernel
17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1 [kernel.kallsyms]

# perf buildid-list -i perf.data.old | grep kernel
47737eb3efdd6cb789872311c354b106ec8e7477 [kernel.kallsyms]

# perf report -i perf.data.old -d [kernel.kallsyms] | head -n 10
# dso: [kernel.kallsyms]
# Events: 30K instructions
#
# Overhead Command Symbol
# ........ ............ ...........................
#
5.55% pipe-test-3m [k] pipe_read
4.78% pipe-test-3m [k] schedule
3.68% pipe-test-3m [k] update_curr
3.52% pipe-test-3m [k] pipe_write

# perf report -i perf.data -d [kernel.kallsyms] | head -n 10
# dso: [kernel.kallsyms]
# Events: 31K instructions
#
# Overhead Command Symbol
# ........ ............ .....................................
#
6.09% pipe-test-3m [k] pipe_read
4.86% pipe-test-3m [k] schedule
4.24% pipe-test-3m [k] update_curr
3.87% pipe-test-3m [k] find_next_bit

# perf diff -v -d [kernel.kallsyms]
build id event received for [kernel.kallsyms]:
47737eb3efdd6cb789872311c354b106ec8e7477
...
build id event received for [kernel.kallsyms]:
17b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
...
Looking at the vmlinux_path (6 entries long)
Using /tmp/.debug/.build-id/47/737eb3efdd6cb789872311c354b106ec8e7477
for symbols
Looking at the vmlinux_path (6 entries long)
Using /tmp/.debug/.build-id/17/b6f2c42deb3725ad35e3dcba2d9fdb92ad47c1
for symbols
# Baseline Delta Symbol
# ........ .......... .....................................
#
0.00% +6.09% 0xffffffff8112a258 ! [k] pipe_read
0.00% +4.86% 0xffffffff8141a206 ! [k] schedule
0.00% +4.24% 0xffffffff810634d8 ! [k] update_curr
0.00% +3.87% 0xffffffff8121f569 ! [k] find_next_bit
0.00% +3.33% 0xffffffff81065cbf ! [k] enqueue_task_fair
0.00% +3.25% 0xffffffff81065824 ! [k] dequeue_task_fair
0.00% +2.77% 0xffffffff81129d10 ! [k] pipe_write
0.00% +2.71% 0xffffffff8114ed97 ! [k] fsnotify

The baseline numbers are showing up as zero and the deltas match the
fractions from the -tip+patches report. Am I missing something here?

Another thing I noticed while running this on -tip is that low-weight
task groups are poorly balanced on -tip (much worse than v2.6.39-rc7).
I started bisecting between v2.6.39-rc7 and -tip to identify the
source of this regression.

[ experiment: create low-weight task group and run ~50 threads with
random sleep/busy pattern ]

-tip:

01:30:03 PM CPU %usr %nice %sys %iowait %irq %soft
%steal %guest %idle intr/s
01:30:04 PM all 90.67 0.00 0.00 0.00 0.00 0.00
0.00 0.00 9.33 15368.00
01:30:05 PM all 93.08 0.00 0.00 0.00 0.00 0.00
0.00 0.00 6.92 15690.00
01:30:06 PM all 94.56 0.00 0.00 0.00 0.00 0.00
0.00 0.00 5.44 15844.00
01:30:07 PM all 94.88 0.00 0.06 0.00 0.00 0.00
0.00 0.00 5.06 15989.00
01:30:08 PM all 94.31 0.00 0.00 0.00 0.00 0.00
0.00 0.00 5.69 15791.00
01:30:09 PM all 95.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 5.00 15953.00
01:30:10 PM all 94.19 0.00 0.06 0.00 0.00 0.00
0.00 0.00 5.75 15810.00
01:30:11 PM all 93.75 0.00 0.00 0.00 0.00 0.00
0.00 0.00 6.25 15748.00
01:30:12 PM all 94.94 0.00 0.06 0.00 0.00 0.00
0.00 0.00 5.00 15943.00
01:30:13 PM all 94.31 0.00 0.00 0.00 0.00 0.00
0.00 0.00 5.69 15865.00
Average: all 93.97 0.00 0.02 0.00 0.00 0.00
0.00 0.00 6.01 15800.10

-tip+patches:

01:29:59 PM CPU %usr %nice %sys %iowait %irq %soft
%steal %guest %idle intr/s
01:30:00 PM all 99.31 0.00 0.56 0.00 0.00 0.00
0.00 0.00 0.12 16908.00
01:30:01 PM all 99.44 0.00 0.50 0.00 0.00 0.06
0.00 0.00 0.00 18128.00
01:30:02 PM all 99.88 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.12 16582.00
01:30:03 PM all 99.06 0.00 0.75 0.00 0.00 0.00
0.00 0.00 0.19 17108.00
01:30:04 PM all 99.94 0.00 0.06 0.00 0.00 0.00
0.00 0.00 0.00 17113.00
01:30:05 PM all 100.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 16568.00
01:30:06 PM all 99.81 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.19 16408.91
01:30:07 PM all 99.87 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.13 16576.00
01:30:08 PM all 99.94 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.06 16617.00
01:30:09 PM all 99.94 0.00 0.00 0.00 0.00 0.06
0.00 0.00 0.00 16702.00
Average: all 99.72 0.00 0.19 0.00 0.00 0.01
0.00 0.00 0.08 16870.63

-Thanks,
Nikhil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Tejun Heo: "Re: [patch v2 0/5] percpu_counter: bug fix and enhancement"
Previous message: Shaohua Li: "Re: [patch v2 0/5] percpu_counter: bug fix and enhancement"
In reply to: Ingo Molnar: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Next in thread: Ingo Molnar: "Re: [PATCH v1 00/19] Increase resolution of load weights"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]