Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

From: Johannes Weiner
Date: Tue Aug 06 2019 - 10:27:33 EST


On Tue, Aug 06, 2019 at 11:36:48AM +0200, Vlastimil Babka wrote:
> On 8/6/19 3:08 AM, Suren Baghdasaryan wrote:
> >> @@ -1280,3 +1285,50 @@ static int __init psi_proc_init(void)
> >> return 0;
> >> }
> >> module_init(psi_proc_init);
> >> +
> >> +#define OOM_PRESSURE_LEVEL 80
> >> +#define OOM_PRESSURE_PERIOD (10 * NSEC_PER_SEC)
> >
> > 80% of the last 10 seconds spent in full stall would definitely be a
> > problem. If the system was already low on memory (which it probably
> > is, or we would not be reclaiming so hard and registering such a big
> > stall) then oom-killer would probably kill something before 8 seconds
> > are passed.
>
> If oom killer can act faster, than great! On small embedded systems you probably
> don't enable PSI anyway?
>
> > If my line of thinking is correct, then do we really
> > benefit from such additional protection mechanism? I might be wrong
> > here because my experience is limited to embedded systems with
> > relatively small amounts of memory.
>
> Well, Artem in his original mail describes a minutes long stall. Things are
> really different on a fast desktop/laptop with SSD. I have experienced this as
> well, ending up performing manual OOM by alt-sysrq-f (then I put more RAM than
> 8GB in the laptop). IMHO the default limit should be set so that the user
> doesn't do that manual OOM (or hard reboot) before the mechanism kicks in. 10
> seconds should be fine.

That's exactly what I have experienced in the past, and this was also
the consistent story in the bug reports we have had.

I suspect it requires a certain combination of RAM size, CPU speed,
and IO capacity: the OOM killer kicks in when reclaim fails, which
happens when all scanned LRU pages were locked and under IO. So IO
needs to be slow enough, or RAM small enough, that the CPU can scan
all LRU pages while they are temporarily unreclaimable (page lock).

It may well be that on phones the RAM is small enough relative to CPU
size.

But on desktops/servers, we frequently see that there is a wider
window of memory consumption in which reclaim efficiency doesn't drop
low enough for the OOM killer to kick in. In the time it takes the CPU
to scan through RAM, enough pages will have *just* finished reading
for reclaim to free them again and continue to make "progress".

We do know that the OOM killer might not kick in for at least 20-25
minutes while the system is entirely unresponsive. People usually
don't wait this long before forcibly rebooting. In a managed fleet,
ssh heartbeat tests eventually fail and force a reboot.

I'm not sure 10s is the perfect value here, but I do think the kernel
should try to get out of such a state, where interacting with the
system is impossible, within a reasonable amount of time.

It could be a little too short for non-interactive number-crunching
systems...