Re: [PATCH v3] mm/oom_kill: show oom eligibility when displaying the current memory state of all tasks

From: Michal Hocko
Date: Mon Aug 02 2021 - 02:34:38 EST


On Fri 30-07-21 17:20:02, Aaron Tomlin wrote:
> Changes since v2:
> - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
> by Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> - Add new header to oom_dump_tasks documentation
> - Provide further justification
>
>
> The output generated by dump_tasks() can be helpful to determine why
> there was an OOM condition and which rogue task potentially caused it.
> Please note that this is only provided when sysctl oom_dump_tasks is
> enabled.
>
> At the present time, when showing potential OOM victims, we do not
> exclude any task that are not OOM eligible e.g.

Well, this is not precise. We do exclude ineligible. Consider tasks that
are outside of the OOM domain for example. You are right that we are not
excluding all of them though.

> those that have
> MMF_OOM_SKIP set; it is possible that the last OOM killable victim was
> already OOM killed, yet the OOM reaper failed to reclaim memory and set
> MMF_OOM_SKIP. This can be confusing (or perhaps even be misleading) to the
> viewer. Now, we already unconditionally display a task's oom_score_adj_min
> value that can be set to OOM_SCORE_ADJ_MIN which is indicative of an
> "unkillable" task.
>
> This patch provides a clear indication with regard to the OOM ineligibility
> (and why) of each displayed task with the addition of a new column namely
> "oom_skipped". An example is provided below:
>
> [ 5084.524970] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj oom_skipped name
> [ 5084.526397] [660417] 0 660417 35869 683 167936 0 -1000 M conmon
> [ 5084.526400] [660452] 0 660452 175834 472 86016 0 -998 pod
> [ 5084.527460] [752415] 0 752415 35869 650 172032 0 -1000 M conmon
> [ 5084.527462] [752575] 1001050000 752575 184205 11158 700416 0 999 npm
> [ 5084.527467] [753606] 1001050000 753606 183380 46843 2134016 0 999 node
> [ 5084.527581] Memory cgroup out of memory: Killed process 753606 (node) total-vm:733520kB, anon-rss:161228kB, file-rss:26144kB, shmem-rss:0kB, UID:1001050000
>
> So, a single character 'M' is for OOM_SCORE_ADJ_MIN, 'R' MMF_OOM_SKIP and
> 'V' for in_vfork().

I have to say I dislike this for two reasons. First and foremost it
makes parsing unnecessarily more complex. Now you have a potentially
an empty column to special case. Secondly MMF_OOM_SKIP is an internal
state that shouldn't really leak to userspace IMO. in_vfork is a racy
check and M is already expressed via oom_score_adj so it duplicates an
existing information.

If you really want/need to make any change here then I would propose to
either add E(eligible)/I(ligible) column without any specifics or
consistently skip over all tasks which are not eligible.
> 2.31.1

--
Michal Hocko
SUSE Labs