[PATCH] psi: trigger the OOM killer on severe thrashing

From: Johannes Weiner
Date: Mon Aug 05 2019 - 13:15:16 EST


Over the last few years we have had many reports that the kernel can
enter an extended livelock situation under sufficient memory
pressure. The system becomes unresponsive and fully IO bound for
indefinite periods of time, and often the user has no choice but to
reboot. Even though the system is clearly struggling with a shortage
of memory, the OOM killer is not engaging reliably.

The reason is that with bigger RAM, and in particular with faster
SSDs, page reclaim does not necessarily fail in the traditional sense
anymore. In the time it takes the CPU to run through the vast LRU
lists, there are almost always some cache pages that have finished
reading in and can be reclaimed, even before userspace had a chance to
access them. As a result, reclaim is nominally succeeding, but
userspace is refault-bound and not making significant progress.

While this is clearly noticable to human beings, the kernel could not
actually determine this state with the traditional memory event
counters. We might see a certain rate of reclaim activity or refaults,
but how long, or whether at all, userspace is unproductive because of
it depends on IO speed, readahead efficiency, as well as memory access
patterns and concurrency of the userspace applications. The same
number of the VM events could be unnoticed in one system / workload
combination, and result in an indefinite lockup in a different one.

However, eb414681d5a0 ("psi: pressure stall information for CPU,
memory, and IO") introduced a memory pressure metric that quantifies
the share of wallclock time in which userspace waits on reclaim,
refaults, swapins. By using absolute time, it encodes all the above
mentioned variables of hardware capacity and workload behavior. When
memory pressure is 40%, it means that 40% of the time the workload is
stalled on memory, period. This is the actual measure for the lack of
forward progress that users can experience. It's also something they
expect the kernel to manage and remedy if it becomes non-existent.

To accomplish this, this patch implements a thrashing cutoff for the
OOM killer. If the kernel determines a sustained high level of memory
pressure, and thus a lack of forward progress in userspace, it will
trigger the OOM killer to reduce memory contention.

Per default, the OOM killer will engage after 15 seconds of at least
80% memory pressure. These values are tunable via sysctls
vm.thrashing_oom_period and vm.thrashing_oom_level.

Ideally, this would be standard behavior for the kernel, but since it
involves a new metric and OOM killing, let's be safe and make it an
opt-in via CONFIG_THRASHING_OOM. Setting vm.thrashing_oom_level to 0
also disables the feature at runtime.

Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Reported-by: "Artem S. Tashkinov" <aros@xxxxxxx>
---
Documentation/admin-guide/sysctl/vm.rst | 24 ++++++++
include/linux/psi.h | 5 ++
include/linux/psi_types.h | 6 ++
kernel/sched/psi.c | 74 +++++++++++++++++++++++++
kernel/sysctl.c | 20 +++++++
mm/Kconfig | 20 +++++++
6 files changed, 149 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 64aeee1009ca..0332cb52bcfc 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -66,6 +66,8 @@ files can be found in mm/swap.c.
- stat_interval
- stat_refresh
- numa_stat
+- thrashing_oom_level
+- thrashing_oom_period
- swappiness
- unprivileged_userfaultfd
- user_reserve_kbytes
@@ -825,6 +827,28 @@ When page allocation performance is not a bottleneck and you want all
echo 1 > /proc/sys/vm/numa_stat


+thrashing_oom_level
+===================
+
+This defines the memory pressure level for severe thrashing at which
+the OOM killer will be engaged.
+
+The default is 80. This means the system is considered to be thrashing
+severely when all active tasks are collectively stalled on memory
+(waiting for page reclaim, refaults, swapins etc) for 80% of the time.
+
+A setting of 0 will disable thrashing-based OOM killing.
+
+
+thrashing_oom_period
+===================
+
+This defines the number of seconds the system must sustain severe
+thrashing at thrashing_oom_level before the OOM killer is invoked.
+
+The default is 15.
+
+
swappiness
==========

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b3de7321219..661ce45900f9 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -37,6 +37,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
poll_table *wait);
#endif

+#ifdef CONFIG_THRASHING_OOM
+extern unsigned int sysctl_thrashing_oom_level;
+extern unsigned int sysctl_thrashing_oom_period;
+#endif
+
#else /* CONFIG_PSI */

static inline void psi_init(void) {}
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 07aaf9b82241..7c57d7e5627e 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -162,6 +162,12 @@ struct psi_group {
u64 polling_total[NR_PSI_STATES - 1];
u64 polling_next_update;
u64 polling_until;
+
+#ifdef CONFIG_THRASHING_OOM
+ /* Severe thrashing state tracking */
+ bool oom_pressure;
+ u64 oom_pressure_start;
+#endif
};

#else /* CONFIG_PSI */
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index f28342dc65ec..4b1b620d6359 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -139,6 +139,7 @@
#include <linux/ctype.h>
#include <linux/file.h>
#include <linux/poll.h>
+#include <linux/oom.h>
#include <linux/psi.h>
#include "sched.h"

@@ -177,6 +178,14 @@ struct psi_group psi_system = {
.pcpu = &system_group_pcpu,
};

+#ifdef CONFIG_THRASHING_OOM
+static void psi_oom_tick(struct psi_group *group, u64 now);
+#else
+static inline void psi_oom_tick(struct psi_group *group, u64 now)
+{
+}
+#endif
+
static void psi_avgs_work(struct work_struct *work);

static void group_init(struct psi_group *group)
@@ -403,6 +412,8 @@ static u64 update_averages(struct psi_group *group, u64 now)
calc_avgs(group->avg[s], missed_periods, sample, period);
}

+ psi_oom_tick(group, now);
+
return avg_next_update;
}

@@ -1280,3 +1291,66 @@ static int __init psi_proc_init(void)
return 0;
}
module_init(psi_proc_init);
+
+#ifdef CONFIG_THRASHING_OOM
+/*
+ * Trigger the OOM killer when detecting severe thrashing.
+ *
+ * Per default we define severe thrashing as 15 seconds of 80% memory
+ * pressure (i.e. all active tasks are collectively stalled on memory
+ * 80% of the time).
+ */
+unsigned int sysctl_thrashing_oom_level = 80;
+unsigned int sysctl_thrashing_oom_period = 15;
+
+static void psi_oom_tick(struct psi_group *group, u64 now)
+{
+ struct oom_control oc = {
+ .order = 0,
+ };
+ unsigned long pressure;
+ bool high;
+
+ /* Disabled at runtime */
+ if (!sysctl_thrashing_oom_level)
+ return;
+
+ /*
+ * Protect the system from livelocking due to thrashing. Leave
+ * per-cgroup policies to oomd, lmkd etc.
+ */
+ if (group != &psi_system)
+ return;
+
+ pressure = LOAD_INT(group->avg[PSI_MEM_FULL][0]);
+ high = pressure >= sysctl_thrashing_oom_level;
+
+ if (!group->oom_pressure && !high)
+ return;
+
+ if (!group->oom_pressure && high) {
+ group->oom_pressure = true;
+ group->oom_pressure_start = now;
+ return;
+ }
+
+ if (group->oom_pressure && !high) {
+ group->oom_pressure = false;
+ return;
+ }
+
+ if (now < group->oom_pressure_start +
+ (u64)sysctl_thrashing_oom_period * NSEC_PER_SEC)
+ return;
+
+ pr_warn("Severe thrashing detected! (%ds of %d%% memory pressure)\n",
+ sysctl_thrashing_oom_period, sysctl_thrashing_oom_level);
+
+ group->oom_pressure = false;
+
+ if (!mutex_trylock(&oom_lock))
+ return;
+ out_of_memory(&oc);
+ mutex_unlock(&oom_lock);
+}
+#endif /* CONFIG_THRASHING_OOM */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f12888971d66..3b9b3deb1836 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -68,6 +68,7 @@
#include <linux/bpf.h>
#include <linux/mount.h>
#include <linux/userfaultfd_k.h>
+#include <linux/psi.h>

#include "../lib/kstrtox.h"

@@ -1746,6 +1747,25 @@ static struct ctl_table vm_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
},
+#endif
+#ifdef CONFIG_THRASHING_OOM
+ {
+ .procname = "thrashing_oom_level",
+ .data = &sysctl_thrashing_oom_level,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = &one_hundred,
+ },
+ {
+ .procname = "thrashing_oom_period",
+ .data = &sysctl_thrashing_oom_period,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ },
#endif
{ }
};
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..cef13b423beb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,24 @@ config ARCH_HAS_PTE_SPECIAL
config ARCH_HAS_HUGEPD
bool

+config THRASHING_OOM
+ bool "Trigger the OOM killer on severe thrashing"
+ select PSI
+ help
+ Under memory pressure, the kernel can enter severe thrashing
+ or swap storms during which the system is fully IO-bound and
+ does not respond to any user input. The OOM killer does not
+ always engage because page reclaim manages to make nominal
+ forward progress, but the system is effectively livelocked.
+
+ This feature uses pressure stall information (PSI) to detect
+ severe thrashing and trigger the OOM killer.
+
+ The OOM killer will be engaged when the system sustains a
+ memory pressure level of 80% for 15 seconds. This can be
+ adjusted using the vm.thrashing_oom_[level|period] sysctls.
+
+ Say Y if you have observed your system becoming unresponsive
+ for extended periods under memory pressure.
+
endmenu
--
2.22.0