Re: [patch 1/2] mm, memcg: avoid oom notification when current needsaccess to memory reserves

From: Michal Hocko
Date: Thu Dec 12 2013 - 05:32:13 EST


On Wed 11-12-13 14:40:24, David Rientjes wrote:
> On Wed, 11 Dec 2013, Michal Hocko wrote:
>
> > > Triggering a pointless notification with PF_EXITING is rare, yet one
> > > pointless notification can be avoided with the patch.
> >
> > Sigh. Yes it will avoid one particular and rare race. There will still
> > be notifications without oom kills.
> >
>
> Would you prefer doing the mem_cgroup_oom_notify() in two places instead:
>
> - immediately before doing oom_kill_process() when it's guaranteed that
> the kernel would have killed something, and
>
> - when memory.oom_control == 1 in mem_cgroup_oom_synchronize()?

Yes that would make sense to me. At least the two oom_control paths
would be consistent wrt. notifications. I thought it would be too messy
but it looks quite straightforward:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c72b03bf9679..5cb1deea6aac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2256,15 +2256,16 @@ bool mem_cgroup_oom_synchronize(bool handle)

locked = mem_cgroup_oom_trylock(memcg);

- if (locked)
- mem_cgroup_oom_notify(memcg);
-
if (locked && !memcg->oom_kill_disable) {
mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
+ /* calls mem_cgroup_oom_notify if there is a task to kill */
mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
current->memcg_oom.order);
} else {
+ if (locked && memcg->oom_kill_disable)
+ mem_cgroup_oom_notify(memcg);
+
schedule();
mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1e4a600a6163..2a7f15900922 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,6 +470,9 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
victim = p;
}

+ if (memcg)
+ mem_cgroup_oom_notify(memcg);
+
/* mm cannot safely be dereferenced after task_unlock(victim) */
mm = victim->mm;
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",


The semantic would be as simple as "notification is sent only when
an action is due". It will be still racy as nothing prevents a task
which is not under OOM to exit and release some memory but there is no
sensible way to address that. On the other hand such a semantic would be
sensible for oom_control listeners because they will know that an action
has to be or will be taken (the line was drawn).

Can we agree on this, Johannes? Or you see the line drawn when
mem_cgroup_oom_synchronize has been reached already no matter whether
the action is to be done or not?

Regardless the above. We would still have to cope with PF_EXITING
without TIF_MEMDIE entering OOM which is a separate issue. I think the
easier and cleaner solution would be to bail out early and do not even
charge for PF_EXITING tasks. It will solve the issue mentioned before
and also reduce the exit latency. Besides that I do not think we are
talking about many charges, do we?

> > Anyway.
> > Does the reclaim make any sense for PF_EXITING tasks? Shouldn't we
> > simply bypass charges of these tasks automatically. Those tasks will
> > free some memory anyway so why to trigger reclaim and potentially OOM
> > in the first place? Do we need to go via TIF_MEMDIE loop in the first
> > place?
> >
>
> I don't see any reason to make an optimization there since they will get
> TIF_MEMDIE set if reclaim has failed on one of their charges or if it
> results in a system oom through the page allocator's oom killer.

This all will happen after MEM_CGROUP_RECLAIM_RETRIES full reclaim
rounds. Is it really worth the addional overhead just to later say "OK
go ahead and skipp charges"?
And for the !oom memcg it might reclaim some pages which could have
stayed on LRUs just to free some memory little bit later and release the
memory pressure.
So I would rather go with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c72b03bf9679..fee25c5934d2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2692,7 +2693,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
* MEMDIE process.
*/
if (unlikely(test_thread_flag(TIF_MEMDIE)
- || fatal_signal_pending(current)))
+ || fatal_signal_pending(current))
+ || current->flags & PF_EXITING)
goto bypass;

if (unlikely(task_in_memcg_oom(current)))

rather than the later checks down the oom_synchronize paths. The comment
already mentions dying process...

> It would be nice to ensure reclaim has had a chance to free memory in
> the presence of any other potential parallel memory freeing.

I am afraid I didn't get what you mean by this. We can only check we are
under OOM or try to reclaim to see if there is something...
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/