RE: Regression in reading /proc/stat in the newer kernels withlarge SMP and NUMA configurations

From: Seger, Mark
Date: Fri Oct 14 2011 - 09:36:49 EST


> > The HP developed collectl utility depends heavily on abstractions in
> /proc and /sys provided by the kernel for kernel statistics capture.
> > During testing on newer NUMA platforms with large numbers of CPUâS HP
> has observed a significant increase in the time taken to parse this
> data with the newer 2.6.32.x and higher kernels.

While this does indeed effect collectl and renders its effectiveness at 0.1 or lower monitoring intervals it effects ANY utility that reads /proc/stat, and the more frequent the access the heavier the load. Remember, the increase in load is by a factor of over 50 times, so those who like to run top or vmstat or even iostat will also see a much heavier hand too. If people ran sar at a lower monitoring frequency it too would suffer.

I've also observed reading /proc/meminfo also consumes a lot more CPU, but not as bad as /proc/stat.

> Strange, here is what I have on my 2x4x2 machine.

Forgive a dumb question, but what does 2X4X2 mean? ;(

I've observed things are fine on 2 socket systems and are truly horrible on the few 8 socket machines I tested this on which also had higher core counts, so there were always on the order of 48+ cpus involved. I haven't done any testing on 4 socket machines and I don't even want to think about what could happen on 16 or 32 or however many socket machines are built in the future.

> Could you try following actions please ?
>
> # perf record ./prog
> Opened, read and closed 8640 times and read total of 29595948 bytes
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.057 MB perf.data (~2482 samples) ]
>
> # perf report --stdio
> # Events: 1K cycles
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ................. ............................
> #
> 22.82% prog [kernel.kallsyms] [k] format_decode
> 21.15% prog [kernel.kallsyms] [k] vsnprintf
> 13.46% prog [kernel.kallsyms] [k] number.isra.4
> 11.33% prog [kernel.kallsyms] [k] memcpy
> 8.87% prog [kernel.kallsyms] [k] kstat_irqs
> 3.80% prog [kernel.kallsyms] [k] radix_tree_lookup_element
> 3.69% prog [kernel.kallsyms] [k] seq_printf
> 3.63% prog [kernel.kallsyms] [k] find_next_bit
> 2.81% prog [kernel.kallsyms] [k] show_stat
> 1.18% prog [kernel.kallsyms] [k] put_dec_trunc
> 0.96% prog [kernel.kallsyms] [k] radix_tree_lookup
> 0.96% prog [kernel.kallsyms] [k] irq_to_desc
> 0.66% prog [kernel.kallsyms] [k] getname_flags
> 0.37% prog [kernel.kallsyms] [k]
> selinux_inode_alloc_security
> 0.22% prog [kernel.kallsyms] [k] jiffies_64_to_clock_t
> 0.22% prog [kernel.kallsyms] [k] copy_user_generic_string
> 0.22% prog [kernel.kallsyms] [k] __call_rcu
> 0.15% prog [kernel.kallsyms] [k] add_preempt_count
> 0.15% prog [kernel.kallsyms] [k] put_dec_full
> 0.15% prog [kernel.kallsyms] [k] nr_context_switches
> 0.15% prog [kernel.kallsyms] [k] kfree

For me the easiest reproducer, which admittedly doesn't show any deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop and time it.

-mark
N‹§²æìr¸›yúèšØb²X¬¶ÇvØ^–)Þ{.nÇ+‰·¥Š{±‘êçzX§¶›¡Ü}©ž²ÆzÚ&j:+v‰¨¾«‘êçzZ+€Ê+zf£¢·hšˆ§~†­†Ûiÿûàz¹®w¥¢¸?™¨è­Ú&¢)ßf”ù^jÇy§m…á@A«a¶Úÿ 0¶ìh®å’i