Re: [PATCH] mm/oom: Add oom_score_adj value to oom Killed process message

From: Edward Chron
Date: Wed Aug 21 2019 - 18:22:21 EST


On Wed, Aug 21, 2019 at 12:19 AM David Rientjes <rientjes@xxxxxxxxxx> wrote:
>
> On Wed, 21 Aug 2019, Michal Hocko wrote:
>
> > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > > haven't left it enabled :/
> >
> > Because it generates a lot of output potentially. Think of a workload
> > with too many tasks which is not uncommon.
>
> Probably better to always print all the info for the victim so we don't
> need to duplicate everything between dump_tasks() and dump_oom_summary().
>
> Edward, how about this?
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
> * State information includes task's pid, uid, tgid, vm size, rss,
> * pgtables_bytes, swapents, oom_score_adj value, and name.
> */
> -static void dump_tasks(struct oom_control *oc)
> +static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
> {
> pr_info("Tasks state (memory values in pages):\n");
> pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n");
>
> + /* If vm.oom_dump_tasks is disabled, only show the victim */
> + if (!sysctl_oom_dump_tasks) {
> + dump_task(victim, oc);
> + return;
> + }
> +
> if (is_memcg_oom(oc))
> mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> else {
> @@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
> if (is_dump_unreclaim_slabs())
> dump_unreclaimable_slab();
> }
> - if (sysctl_oom_dump_tasks)
> - dump_tasks(oc);
> + if (p || sysctl_oom_dump_tasks)
> + dump_tasks(oc, p);
> if (p)
> dump_oom_summary(oc, p);
> }

I would be willing to accept this, though as Michal mentions in his
post, it would be very helpful to have the oom_score_adj on the Killed
process message.

One reason for that is that the Killed process message is the one
message that is printed with error priority (pr_err)
and so that message can be filtered out and sent to notify support
that an OOM event occurred.
Putting any information that can be shared in that message is useful
from my experience as it the initial point of triage for an OOM event.
Even if the full log with per user process is available it the
starting point for triage for an OOM event.

So from my perspective I would be happy having both, with David's
proposal providing a bit of extra information as shown here:

Jul 21 20:07:48 linuxserver kernel: [ pid ] uid tgid total_vm
rss pgtables_bytes swapents oom_score_adj name
Jul 21 20:07:48 linuxserver kernel: [ 547] 0 547 31664
615 299008 0 0
systemd-journal

The OOM Killed process message will print as:

Jul 21 20:07:48 linuxserver kernel: Out of memory: Killed process 2826
(oomprocs) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
shmem-rss:0kB oom_score_adj:1000

But if only one one output change is allowed I'd favor the Killed
process message since that can be singled due to it's print priority
and forwarded.

By the way, right now there is redundancy in that the Killed process
message is printing vm, rss even if vm.oom_dump_tasks is enabled.
I don't see why that is a big deal.
It is very useful to have all the information that is there.
Wouldn't mind also having pgtables too but we would be able to get
that from the output of dump_task if that is enabled.

If it is acceptable to also add the dump_task for the killed process
for !sysctl_oom_dump_tasks I can repost the patch including that as
well.

Thank-you,

Edward Chron
Arista Networks