Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

From: ndrw
Date: Thu Aug 08 2019 - 17:59:36 EST


On 08/08/2019 19:59, Michal Hocko wrote:
Well, I am afraid that implementing anything like that in the kernel
will lead to many regressions and bug reports. People tend to have very
different opinions on when it is suitable to kill a potentially
important part of a workload just because memory gets low.

Are you proposing having a zero memory reserve or not having such option at all? I'm fine with the current default (zero reserve/margin).

I strongly prefer forcing OOM killer when the system is still running normally. Not just for preventing stalls: in my limited testing I found the OOM killer on a stalled system rather inaccurate, occasionally killing system services etc. I had much better experience with earlyoom.

LRU aspect doesn't help much, really. If we are reclaiming the same set
of pages becuase they are needed for the workload to operate then we are
effectivelly treshing no matter what kind of replacement policy you are
going to use.

In my case it would work fine (my system already works well with earlyoom, and without it it remains responsive until last couple hundred MB of RAM).


PSI is giving you a matric that tells you how much time you
spend on the memory reclaim. So you can start watching the system from
lower utilization already.

I've tested it on a system with 45GB of RAM, SSD, swap disabled (my intention was to approximate a worst-case scenario) and it didn't really detect stall before it happened. I can see some activity after reaching ~42GB, the system remains fully responsive until it suddenly freezes and requires sysrq-f. PSI appears to increase a bit when the system is about to run out of memory but the change is so small it would be difficult to set a reliable threshold. I expect the PSI numbers to increase significantly after the stall (I wasn't able to capture them) but, as mentioned above, I was hoping for a solution that would work before the stall.

$ while true; do sleep 1; cat /proc/pressure/memory ; done
[starting a test script and waiting for several minutes to fill up memory]
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
some avg10=0.00 avg60=0.00 avg300=0.00 total=10389
full avg10=0.00 avg60=0.00 avg300=0.00 total=6442
some avg10=0.00 avg60=0.00 avg300=0.00 total=18950
full avg10=0.00 avg60=0.00 avg300=0.00 total=11576
some avg10=0.00 avg60=0.00 avg300=0.00 total=25655
full avg10=0.00 avg60=0.00 avg300=0.00 total=16159
some avg10=0.00 avg60=0.00 avg300=0.00 total=31438
full avg10=0.00 avg60=0.00 avg300=0.00 total=19552
some avg10=0.00 avg60=0.00 avg300=0.00 total=44549
full avg10=0.00 avg60=0.00 avg300=0.00 total=27772
some avg10=0.00 avg60=0.00 avg300=0.00 total=52520
full avg10=0.00 avg60=0.00 avg300=0.00 total=32580
some avg10=0.00 avg60=0.00 avg300=0.00 total=60451
full avg10=0.00 avg60=0.00 avg300=0.00 total=37704
some avg10=0.00 avg60=0.00 avg300=0.00 total=68986
full avg10=0.00 avg60=0.00 avg300=0.00 total=42859
some avg10=0.00 avg60=0.00 avg300=0.00 total=76598
full avg10=0.00 avg60=0.00 avg300=0.00 total=48370
some avg10=0.00 avg60=0.00 avg300=0.00 total=83080
full avg10=0.00 avg60=0.00 avg300=0.00 total=52930
some avg10=0.00 avg60=0.00 avg300=0.00 total=89384
full avg10=0.00 avg60=0.00 avg300=0.00 total=56350
some avg10=0.00 avg60=0.00 avg300=0.00 total=95293
full avg10=0.00 avg60=0.00 avg300=0.00 total=60260
some avg10=0.00 avg60=0.00 avg300=0.00 total=101566
full avg10=0.00 avg60=0.00 avg300=0.00 total=64408
some avg10=0.00 avg60=0.00 avg300=0.00 total=108131
full avg10=0.00 avg60=0.00 avg300=0.00 total=68412
some avg10=0.00 avg60=0.00 avg300=0.00 total=121932
full avg10=0.00 avg60=0.00 avg300=0.00 total=77413
some avg10=0.00 avg60=0.00 avg300=0.00 total=140807
full avg10=0.00 avg60=0.00 avg300=0.00 total=91269
some avg10=0.00 avg60=0.00 avg300=0.00 total=170494
full avg10=0.00 avg60=0.00 avg300=0.00 total=110611
[stall, sysrq-f]

Best regards,

ndrw