Re: 4.8.8 kernel trigger OOM killer repeatedly when I have lots of RAM that should be free

From: Linus Torvalds
Date: Tue Nov 29 2016 - 12:07:20 EST


On Tue, Nov 29, 2016 at 8:34 AM, Marc MERLIN <marc@xxxxxxxxxxx> wrote:
> Now, to be fair, this is not a new problem, it's just varying degrees of
> bad and usually only happens when I do a lot of I/O with btrfs.

One situation where I've seen something like this happen is

(a) lots and lots of dirty data queued up
(b) horribly slow storage
(c) filesystem that ends up serializing on writeback under certain
circumstances

The usual case for (b) in the modern world is big SSD's that have bad
worst-case behavior (ie they may do gbps speeds when doing well, and
then they come to a screeching halt when their buffers fill up and
they have to do rewrites, and their gbps throughput drops to mbps or
lower).

Generally you only find that kind of really nasty SSD in the USB stick
world these days.

The usual case for (c) is "fsync" or similar - often on a totally
unrelated file - which then ends up waiting for everything else to
flush too. Looks like btrfs_start_ordered_extent() does something kind
of like that, where it waits for data to be flushed.

The usual *fix* for this is to just not get into situation (a).

Sadly, our defaults for "how much dirty data do we allow" are somewhat
buggered. The global defaults are in "percent of memory", and are
generally _much_ too high for big-memory machines:

[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_ratio
20
[torvalds@i7 linux]$ cat /proc/sys/vm/dirty_background_ratio
10

says that it only starts really throttling writes when you hit 20% of
all memory used. You don't say how much memory you have in that
machine, but if it's the same one you talked about earlier, it was
24GB. So you can have 4GB of dirty data waiting to be flushed out.

And we *try* to do this per-device backing-dev congestion thing to
make things work better, but it generally seems to not work very well.
Possibly because of inconsistent write speeds (ie _sometimes_ the SSD
does really well, and we want to open up, and then it shuts down).

One thing you can try is to just make the global limits much lower. As in

echo 2 > /proc/sys/vm/dirty_ratio
echo 1 > /proc/sys/vm/dirty_background_ratio

(if you want to go lower than 1%, you'll have to use the
"dirty_*ratio_bytes" byte limits instead of percentage limits).

Obviously you'll need to be root for this, and equally obviously it's
really a failure of the kernel. I'd *love* to get something like this
right automatically, but sadly it depends so much on memory size,
load, disk subsystem, etc etc that I despair at it.

On x86-32 we "fixed" this long ago by just saying "high memory is not
dirtyable", so you were always limited to a maximum of 10/20% of 1GB,
rather than the full memory range. It worked better, but it's a sad
kind of fix.

(See commit dc6e29da9162: "Fix balance_dirty_page() calculations with
CONFIG_HIGHMEM")

Linus