Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure

From: Johannes Weiner
Date: Mon Aug 05 2019 - 15:31:54 EST


On Mon, Aug 05, 2019 at 02:13:16PM +0200, Vlastimil Babka wrote:
> On 8/4/19 11:23 AM, Artem S. Tashkinov wrote:
> > Hello,
> >
> > There's this bug which has been bugging many people for many years
> > already and which is reproducible in less than a few minutes under the
> > latest and greatest kernel, 5.2.6. All the kernel parameters are set to
> > defaults.
> >
> > Steps to reproduce:
> >
> > 1) Boot with mem=4G
> > 2) Disable swap to make everything faster (sudo swapoff -a)
> > 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox
> > 4) Start opening tabs in either of them and watch your free RAM decrease
> >
> > Once you hit a situation when opening a new tab requires more RAM than
> > is currently available, the system will stall hard. You will barely be
> > able to move the mouse pointer. Your disk LED will be flashing
> > incessantly (I'm not entirely sure why). You will not be able to run new
> > applications or close currently running ones.
>
> > This little crisis may continue for minutes or even longer. I think
> > that's not how the system should behave in this situation. I believe
> > something must be done about that to avoid this stall.
>
> Yeah that's a known problem, made worse SSD's in fact, as they are able
> to keep refaulting the last remaining file pages fast enough, so there
> is still apparent progress in reclaim and OOM doesn't kick in.
>
> At this point, the likely solution will be probably based on pressure
> stall monitoring (PSI). I don't know how far we are from a built-in
> monitor with reasonable defaults for a desktop workload, so CCing
> relevant folks.

Yes, psi was specifically developed to address this problem. Before
it, the kernel had to make all decisions based on relative event rates
but had no notion of time. Whereas to the user, time is clearly an
issue, and in fact makes all the difference. So psi quantifies the
time the workload spends executing vs. spinning its wheels.

But choosing a universal cutoff for killing is not possible, since it
depends on the workload and the user's expectation: GUI and other
latency-sensitive applications care way before a compile job or video
encoding would care.

Because of that, there are things like oomd and lmkd as mentioned, to
leave the exact policy decision to userspace.

That being said, I think we should be able to provide a bare minimum
inside the kernel to avoid complete livelocks where the user does not
believe the machine would be able to recover without a reboot.

The goal wouldn't be a glitch-free user experience - the kernel does
not know enough about the applications to even attempt that. It should
just not hang indefinitely. Maybe similar to the hung task detector.

How about something like the below patch? With that, the kernel
catches excessive thrashing that happens before reclaim fails:

[root@ham ~]# stress -d 128 -m 5
stress: info: [344] dispatching hogs: 0 cpu, 0 io, 5 vm, 128 hdd
Excessive and sustained system-wide memory pressure!
kworker/1:2 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
CPU: 1 PID: 77 Comm: kworker/1:2 Not tainted 5.3.0-rc1-mm1-00121-ge34a5cf28771 #142
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
Workqueue: events psi_avgs_work
Call Trace:
dump_stack+0x46/0x60
dump_header+0x5c/0x3d5
? irq_work_queue+0x46/0x50
? wake_up_klogd+0x2b/0x30
? vprintk_emit+0xe5/0x190
oom_kill_process.cold.10+0xb/0x10
out_of_memory+0x1ea/0x260
update_averages.cold.8+0x14/0x25
? collect_percpu_times+0x84/0x1f0
psi_avgs_work+0x80/0xc0
process_one_work+0x1bb/0x310
worker_thread+0x28/0x3c0
? process_one_work+0x310/0x310
kthread+0x108/0x120
? __kthread_create_on_node+0x170/0x170
ret_from_fork+0x35/0x40
Mem-Info:
active_anon:109463 inactive_anon:109564 isolated_anon:298
active_file:4676 inactive_file:4073 isolated_file:455
unevictable:0 dirty:8475 writeback:8 unstable:0
slab_reclaimable:2585 slab_unreclaimable:4932
mapped:413 shmem:2 pagetables:1747 bounce:0
free:13472 free_pcp:17 free_cma:0

Possible snags and questions:

1. psi is an optional feature right now, but these livelocks commonly
affect desktop users. What should be the default behavior?

2. Should we make the pressure cutoff and time period configurable?

I fear we would open a can of worms similar to the existing OOM
killer, where users are trying to use a kernel self-protection
mechanism to implement workload QoS and priorities - things that
should firmly be kept in userspace.

3. swapoff annotation. Due to the swapin annotation, swapoff currently
raises memory pressure. It probably shouldn't. But this will be a
bigger problem if we trigger the oom killer based on it.

4. Killing once every 10s assumes basically one big culprit. If the
pressure is created by many different processes, fixing the
situation could take quite a while.

What oomd does to solve this is to monitor the PGSCAN counters
after a kill, to tell whether pressure is persisting, or just from
residual refaults after the culprit has been dealt with.

We may need to do something similar here. Or find a solution to
encode that distinction into psi itself, and it would also take
care of the swapoff problem, since it's basically the same thing -
residual refaults without any reclaim pressure to sustain them.

Anyway, here is the draft patch: