Re: [PATCH 0/3] OOM detection rework v4

From: Andrew Morton
Date: Wed Dec 16 2015 - 18:35:21 EST

Next message: Alexandre Belloni: "[PATCH 2/2] rtc: abx80x: add alarm support"
Previous message: David Miller: "Re: [PATCH] 82xx: FCC: Fixing a bug causing to FCC port lock-up"
Next in thread: Michal Hocko: "Re: [PATCH 0/3] OOM detection rework v4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@xxxxxxxxxx> wrote:

> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.

We've been futzing with this stuff for many years and it still isn't
working well. This makes me expect that the new implementation will
take a long time to settle in.

To aid and accelerate this process I suggest we lard this code up with
lots of debug info, so when someone reports an issue we have the best
possible chance of understanding what went wrong.

This is easy in the case of oom-too-early - it's all slowpath code and
we can just do printk(everything). It's not so easy in the case of
oom-too-late-or-never. The reporter's machine just hangs or it
twiddles thumbs for five minutes then goes oom. But there are things
we can do here as well, such as:

- add an automatic "nearly oom" detection which detects when things
start going wrong and turns on diagnostics (this would need an enable
knob, possibly in debugfs).

- forget about an autodetector and simply add a debugfs knob to turn on
the diagnostics.

- sprinkle tracepoints everywhere and provide a set of
instructions/scripts so that people who know nothing about kernel
internals or tracing can easily gather the info we need to understand
issues.

- add a sysrq key to turn on diagnostics. Pretty essential when the
machine is comatose and doesn't respond to keystrokes.

- something else

So... please have a think about it? What can we add in here to make it
as easy as possible for us (ie: you ;)) to get this code working well?
At this time, too much developer support code will be better than too
little. We can take it out later on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alexandre Belloni: "[PATCH 2/2] rtc: abx80x: add alarm support"
Previous message: David Miller: "Re: [PATCH] 82xx: FCC: Fixing a bug causing to FCC port lock-up"
Next in thread: Michal Hocko: "Re: [PATCH 0/3] OOM detection rework v4"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]