Re: Regression in reading /proc/stat in the newer kernels with largeSMP and NUMA configurations

From: David Rientjes
Date: Fri Oct 14 2011 - 06:08:30 EST


On Fri, 14 Oct 2011, Oberman, Laurence (HAS GSE) wrote:

> On the 2.6.16.x series sles or 2.6.18.x RHEL kernels each read takes 84
> us.
>
> dl785sles:~/wrk # cat 2.6.16.60-0.87.1-smp.log Opened, read and closed 8640 times and read total of 13996800 bytes
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 99.41 0.725717 84 8641 read
> 0.38 0.002768 0 8642 open
> 0.21 0.001539 0 8642 close
> 0.00 0.000000 0 1 write
> 0.00 0.000000 0 3 fstat
> 0.00 0.000000 0 8 mmap
> 0.00 0.000000 0 2 mprotect
> 0.00 0.000000 0 1 munmap
> 0.00 0.000000 0 1 brk
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 1 madvise
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.730024 25945 1 total
>
> On the 2.6.38.4 kernel each read takes > 6ms
>
> dl785sles:~/wrk # cat 2.6.38.4-smp.log
> Opened, read and closed 8640 times and read total of 86235840 bytes
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 59.021650 6830 8641 read
> 0.00 0.000520 0 8642 open
> 0.00 0.000425 0 8642 close
> 0.00 0.000000 0 1 write
> 0.00 0.000000 0 3 fstat
> 0.00 0.000000 0 9 mmap
> 0.00 0.000000 0 2 mprotect
> 0.00 0.000000 0 1 munmap
> 0.00 0.000000 0 1 brk
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 1 madvise
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 arch_prctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 59.022595 25946 1 total
>

The overhead is probably in kstat_irqs_cpu() which is called for each
possible irq for each of the 32 possible cpus, and /proc/stat actually
does the sum twice. You would see the same type of overhead with
/proc/interrupts if it wasn't masked by the locking that it requires to
safely read irq_desc. "dmesg | grep nr_irqs" will show how many percpu
variables are being read for every cpu twice.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/