Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes

From: David Ahern
Date: Thu Aug 13 2020 - 00:47:41 EST


On 8/12/20 1:51 AM, Andrei Vagin wrote:
>
> I rebased the task_diag patches on top of v5.8:
> https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag

Thanks for updating the patches.

>
> /proc/pid files have three major limitations:
> * Requires at least three syscalls per process per file
> open(), read(), close()
> * Variety of formats, mostly text based
> The kernel spent time to encode binary data into a text format and
> then tools like top and ps spent time to decode them back to a binary
> format.
> * Sometimes slow due to extra attributes
> For example, /proc/PID/smaps contains a lot of useful informations
> about memory mappings and memory consumption for each of them. But
> even if we don't need memory consumption fields, the kernel will
> spend time to collect this information.

that's what I recall as well.

>
> More details and numbers are in this article:
> https://avagin.github.io/how-fast-is-procfs
>
> This new interface doesn't have only one of these limitations, but
> task_diag doesn't have all of them.
>
> And I compared how fast each of these interfaces:
>
> The test environment:
> CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
> RAM: 16GB
> kernel: v5.8 with task_diag and /proc/all patches.
> 100K processes:
> $ ps ax | wc -l
> 10228

100k processes but showing 10k here??

>
> $ time cat /proc/all/status > /dev/null
>
> real 0m0.577s
> user 0m0.017s
> sys 0m0.559s
>
> task_proc_all is used to read /proc/pid/status for all tasks:
> https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c
>
> $ time ./task_proc_all status
> tasks: 100230
>
> real 0m0.924s
> user 0m0.054s
> sys 0m0.858s
>
>
> /proc/all/status is about 40% faster than /proc/*/status.
>
> Now let's take a look at the perf output:
>
> $ time perf record -g cat /proc/all/status > /dev/null
> $ perf report
> - 98.08% 1.38% cat [kernel.vmlinux] [k] entry_SYSCALL_64
> - 96.70% entry_SYSCALL_64
> - do_syscall_64
> - 94.97% ksys_read
> - 94.80% vfs_read
> - 94.58% proc_reg_read
> - seq_read
> - 87.95% proc_pid_status
> + 13.10% seq_put_decimal_ull_width
> - 11.69% task_mem
> + 9.48% seq_put_decimal_ull_width
> + 10.63% seq_printf
> - 10.35% cpuset_task_status_allowed
> + seq_printf
> - 9.84% render_sigset_t
> 1.61% seq_putc
> + 1.61% seq_puts
> + 4.99% proc_task_name
> + 4.11% seq_puts
> - 3.76% render_cap_t
> 2.38% seq_put_hex_ll
> + 1.25% seq_puts
> 2.64% __task_pid_nr_ns
> + 1.54% get_task_mm
> + 1.34% __lock_task_sighand
> + 0.70% from_kuid_munged
> 0.61% get_task_cred
> 0.56% seq_putc
> 0.52% hugetlb_report_usage
> 0.52% from_kgid_munged
> + 4.30% proc_all_next
> + 0.82% _copy_to_user
>
> We can see that the kernel spent more than 50% of the time to encode binary
> data into a text format.
>
> Now let's see how fast task_diag:
>
> $ time ./task_diag_all all -c -q
>
> real 0m0.087s
> user 0m0.001s
> sys 0m0.082s
>
> Maybe we need resurrect the task_diag series instead of inventing
> another less-effective interface...

I think the netlink message design is the better way to go. As system
sizes continue to increase (> 100 cpus is common now) you need to be
able to pass the right data to userspace as fast as possible to keep up
with what can be a very dynamic userspace and set of processes.

When you first proposed this idea I was working on systems with >= 1k
cpus and the netlink option was able to keep up with a 'make -j N' on
those systems. `perf record` walking /proc would never finish
initializing - I had to add a "done initializing" message to know when
to start a test. With the task_diag approach, perf could collect the
data in short order and move on to recording data.