Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks

From: David Rientjes
Date: Tue Jan 19 2016 - 18:01:58 EST


On Fri, 15 Jan 2016, One Thousand Gnomes wrote:

> > > I think it's time to kill sysrq+F and I'll send those two patches
> > > unless there is a usecase I'm not aware of.
> >
> > I have described one in the part you haven't quoted here. Let me repeat:
> > : Your system might be trashing to the point you are not able to log in
> > : and resolve the situation in a reasonable time yet you are still not
> > : OOM. sysrq+f is your only choice then.
> >
> > Could you clarify why it is better to ditch a potentially usefull
> > emergency tool rather than to make it work reliably and predictably?
>
> Even if it doesn't work reliably and predictably it is *still* better
> than removing it as it works currently. Today we have "might save you a
> reboot", the removal turns it into "you'll have to reboot". That's a
> regression.
>

Under what circumstance are you supposing to use sysrq+f in your
hypothetical? If you have access to the shell, then you can kill any
process at random (and you may even be able to make better realtime
decisions than the oom killer) and it will gain access to memory reserves
immediately under my proposal when it tries to allocate memory. The net
result is that calling the oom killer is no better than you issuing the
SIGKILL yourself.

This doesn't work if your are supposing to use sysrq+f without the ability
to get access to the shell. That's the point, I believe, that Michal has
raised in this thread. I'd like to address that issue directly rather
than requiring human intervention to fix. If you have deployed a very
large number of machines to your datacenters, you don't possibly have the
resources to do this.