[PATCH v3] mm: oom: introduce cpuset oom

From: Gang Li
Date: Sun Apr 09 2023 - 22:51:36 EST


Cpusets constrain the CPU and Memory placement of tasks.
`CONSTRAINT_CPUSET` type in oom has existed for a long time, but
has never been utilized.

When a process in cpuset which constrain memory placement triggers
oom, it may kill a completely irrelevant process on other numa nodes,
which will not release any memory for this cpuset.

We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
selecting victim from all cpusets with the same mems_allowed as the
current cpuset.

Example:

Create two processes named mem_on_node0 and mem_on_node1 constrained
by cpusets respectively. These two processes alloc memory on their
own node. Now node0 has run out of memory, OOM will be invokled by
mem_on_node0.

Before this patch:

Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from
the entire system. Therefore, the OOM is highly likely to kill
mem_on_node1, which will not free any memory for mem_on_node0. This
is a useless kill.

```
[ 2786.519080] mem_on_node0 invoked oom-killer
[ 2786.885738] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 2787.181724] [ 13432] 0 13432 787016 786745 6344704 0 0 mem_on_node1
[ 2787.189115] [ 13457] 0 13457 787002 785504 6340608 0 0 mem_on_node0
[ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1)
```

After this patch:

The victim will be selected only in all cpusets that have the same
mems_allowed as the cpuset that invoked oom. This will prevent
useless kill and protect innocent victims.

```
[ 395.922444] mem_on_node0 invoked oom-killer
[ 396.239777] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 396.246128] [ 2614] 0 2614 1311294 1144192 9224192 0 0 mem_on_node0
[ 396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[ 396.264068] Out of memory: Killed process 2614 (mem_on_node0)
```

Suggested-by: Michal Hocko <mhocko@xxxxxxxx>
Cc: <cgroups@xxxxxxxxxxxxxxx>
Cc: <linux-mm@xxxxxxxxx>
Cc: <rientjes@xxxxxxxxxx>
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Zefan Li <lizefan.x@xxxxxxxxxxxxx>
Signed-off-by: Gang Li <ligang.bdlg@xxxxxxxxxxxxx>
---
Changes in v3:
- Provide more details about the use case, testing, implementation.
- Document the userspace visible change in Documentation.
- Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add
a doctext comment about its purpose and how it should be used.
- Take cpuset_rwsem to ensure that cpusets are stable.

Changes in v2:
- https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@xxxxxxxxxxxxx/
- Select victim from all cpusets with the same mems_allowed as the current cpuset.
(David Rientjes <rientjes@xxxxxxxxxx>)

v1:
- https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@xxxxxxxxxxxxx/
- Introduce cpuset oom.
---
.../admin-guide/cgroup-v1/cpusets.rst | 14 +++++-
include/linux/cpuset.h | 6 +++
kernel/cgroup/cpuset.c | 44 +++++++++++++++++++
mm/oom_kill.c | 4 ++
4 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 5d844ed4df69..d686cd47e91d 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -25,7 +25,8 @@ Written by Simon.Derr@xxxxxxxx
1.6 What is memory spread ?
1.7 What is sched_load_balance ?
1.8 What is sched_relax_domain_level ?
- 1.9 How do I use cpusets ?
+ 1.9 What is cpuset oom ?
+ 1.10 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
@@ -607,8 +608,17 @@ If your situation is:
- The latency is required even it sacrifices cache hit rate etc.
then increasing 'sched_relax_domain_level' would benefit you.

+1.9 What is cpuset oom ?
+--------------------------
+If there is no available memory to allocate on the nodes specified by
+cpuset.mems, then an OOM (Out-Of-Memory) will be invoked.
+
+Since the victim selection is a heuristic algorithm, we cannot select
+the "perfect" victim. Therefore, currently, the victim will be selected
+from all the cpusets that have the same mems_allowed as the cpuset
+which invoked OOM.

-1.9 How do I use cpusets ?
+1.10 How do I use cpusets ?
--------------------------

In order to minimize the impact of cpusets on critical kernel
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 980b76a1237e..75465bf58f74 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}

+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
+
#else /* !CONFIG_CPUSETS */

static inline bool cpusets_enabled(void) { return false; }
@@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}

+static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+ return 0;
+}
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc4dcfd7bee5..4c51225568aa 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4013,6 +4013,50 @@ void cpuset_print_current_mems_allowed(void)
rcu_read_unlock();
}

+/**
+ * cpuset_scan_tasks - specify the oom scan range
+ * @fn: callback function to select oom victim
+ * @arg: argument for callback function, usually a pointer to struct oom_control
+ *
+ * Description: This function is used to specify the oom scan range. Return 0 if
+ * no task is selected, otherwise return 1. The selected task will be stored in
+ * arg->chosen. Thins function can only be called in select_bad_process()
+ * while oc->onstraint == CONSTRAINT_CPUSET.
+ *
+ * The selection algorithm is heuristic, therefore requires constant iteration
+ * based on user feedback. Currently, we just iterate through all cpusets with
+ * the same mems_allowed as the current cpuset.
+ */
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+ int ret = 0;
+ struct css_task_iter it;
+ struct task_struct *task;
+ struct cpuset *cs;
+ struct cgroup_subsys_state *pos_css;
+
+ /*
+ * Situation gets complex with overlapping nodemasks in different cpusets.
+ * TODO: Maybe we should calculate the "distance" between different mems_allowed.
+ *
+ * But for now, let's make it simple. Just iterate through all cpusets
+ * with the same mems_allowed as the current cpuset.
+ */
+ cpuset_read_lock();
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+ if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
+ css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = fn(task, arg);
+ css_task_iter_end(&it);
+ }
+ }
+ rcu_read_unlock();
+ cpuset_read_unlock();
+ return ret;
+}
+
/*
* Collection of memory_pressure is suppressed unless
* this flag is enabled by writing "1" to the special
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..228257788d9e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc)

if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+ else if (oc->constraint == CONSTRAINT_CPUSET)
+ cpuset_scan_tasks(oom_evaluate_task, oc);
else {
struct task_struct *p;

@@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc)

if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+ else if (oc->constraint == CONSTRAINT_CPUSET)
+ cpuset_scan_tasks(dump_task, oc);
else {
struct task_struct *p;

--
2.20.1