Re: [PATCH] Prevent OOM casualties by enforcing memcg limits

From: Chris Down
Date: Tue Apr 27 2021 - 08:27:00 EST


Alexander Sosna writes:
We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It
can still happen for a bunch of reasons, so I really hope PostgreSQL
isn't relying on that.

Could you please be more clear about the "huge problem" being solved
here? I'm not seeing it.

let me explain the problem I encounter and why I fell down the mm rabbit
hole. It is not a PostgreSQL specific problem but that's where I run
into it. PostgreSQL forks a backend for each client connection. All
backends have shared memory as well as local work memory. When a
backend needs more dynamic work_mem to execute a query, new memory
is allocated. It is normal that such an allocation can fail. If the
backend gets an ENOMEM the current query is rolled back an all dynamic
work_mem is freed. The RDBMS stays operational an no other query is
disturbed.

When running in a memory cgroup - for example via systemd or on k8s -
the kernel will not return ENOMEM even if the cgroup's memory limit is
exceeded. Instead the OOM killer is awakened and kills processes in the
violating cgroup. If any backend is killed with SIGKILL the shared
memory of the whole cluster is deemed potentially corrupted and
PostgreSQL needs to do an emergency restart. This cancels all operation
on all backends and it entails a potentially lengthy recovery process.
Therefore the behavior is quite "costly".

My point that memory cgroups are completely overcommit agnostic isn't just a question of abstract semantics, but a practical one. Exceeding memory.max is not overcommitment, because overages are physical, not virtual, and that has vastly different ramifications in terms of what managing that overage means.

For example, if we aggressively ENOMEM at the memory.max bounds, there's no provision provided for the natural bounds of memory reclaim to occur. Now maybe your application likes that (which I find highly dubious), but from a memory balancing perspective it's just nonsensical: we need to ensure that we're assisting forward progress of the system at the cgroup level, especially with the huge amounts of slack generated.

I totally understand that vm.overcommit_memory 2 does not mean "no OOM
killer". IMHO it should mean "no OOM killer if we can avoid it" and I
would highly appreciate if the kernel would use a less invasive means
whenever possible. I guess this might also be the expectation by many
other users. In my described case - which is a real pain for me - it is
quite easy to tweak the kernel behavior in order to handle this and
other similar situations with less casualties. This is why I send a
patch instead of starting a theoretical discussion.

vm.overcommit_memory=2 means "don't overcommit", nothing less, nothing more. Adding more semantics is a very good way to make an extremely confusing and overloaded API.

This commit reminds me of the comments on cosmetic products that say "no parabens". Ok, so there's no parabens -- great, parabens are terrible -- but are you now using a much more dangerous preservative instead?

Likewise, this commit claims that it reduces the likelihood of invoking the OOM killer -- great, nobody wants their processes to be OOM killed. What do we have instead? Code that calls off memory allocations way, way before it's needed to do so, and prevents the system from even getting into a state where it can efficiently evaluate how it should rebalance memory. That's really not a good tradeoff.

What do you think is necessary to get this to an approvable quality?

The problem is not the code, it's the concept and the way it interacts with the rest of the mm subsystem. It asks the mm subsystem to deny memory allocations long before it has even had a chance to reliably rebalance (just as one example, to punt anon pages to swap) based on the new allocations, which doesn't make very much sense. It may not break in some highly trivial setups, but it certainly will not work well with stacking or machines with high volatility of the anon/file LRUs. You're also likely to see random ENOMEM failures from kernelspace when operating under this memcg context long before such a response was necessary, which doesn't make much sense.

If you want to know when to back off allocations, use memory.high with PSI pressure metrics.

I also would strongly suggest that vm.overcommit_memory=2 is the equivalent of using a bucket of ignited thermite to warm one's house.