Re: [discussion] Swap overcommitment recovery

Rik van Riel (H.H.vanRiel@phys.uu.nl)
Mon, 17 Aug 1998 20:28:18 +0200 (CEST)


On Sun, 16 Aug 1998, david wrote:

> Here are some of the ideas that have been tossed into the fray:
>
> - userland daemon
> - kernel thread

It doesn't have to be a special thread, it can run inside
kswapd's context. See the [PATCH] I posted sunday. If you
are interested in a solution, please test the patch.

As soon as I get reactions, I'll adjust the code, make it
sysctl tunable and generally beautify it. If nobody is
interested, I'll leave the code the quick and dirty way
it is now :)

> - let the user deal with it
>
> Well...let's go over these. A userland daemon sits under the same
> restrictions as all the other processes.
> Think Mr. Policeman with a foam baton giving out parking tickets in his
> little orange blinking light scooter.
>
> Now let's discuss implementing a means to accomplish this in the kernel.
> Let's toss out some ideas that have/have not been discussed.
>
> - kill the biggest ram sucker

Not always a good idea. The biggest ram sucker might have
been running for 2 weeks without doing any new allocations.
This means wasting 2 weeks of CPU time _and_ killing a
'good guy'.

> - kill the program requesting a page right now

Bad idea, this program might control some hardware (X) leading
to a system crash anyway.

> - start killing user programs until the condition clears leading to:
> - start killing root programs ... leading to:

This is a little bit crude, don't you think. Look at my patch for
a more subtle idea (which several people have thought about for
several months).

> - init is the only thing left, something must be really wrong, now is the
> time to attempt disable writes, sync, mount RO, call our internal reboot
> methods.

This part isn't implemented yet, but it should indeed be included.
Syncing and umounting the disks should almost guarantee a fast
system startup with minimal downtime.

> - intelligently kill the process(es):

See my patch.

> - put all processes to sleep and notify a userland identified process of
> the situation and let that process decide what to do based on what the
> user has in a configuration file. note, this process -must- have all
> it's pages previously allocated or it too is simply part of the problem.

This userland process takes up more space than a kernel-based
solution. Furthermore, it still can't do without kernel support.

I don't really see the advantage above kernel-based code, except
the point of flexibility, which is something most people won't
really care about anyway when they have to loose something
anyway.

> Such a program can also be made a module which would make the kernel
> interface with it much easier.

This could be nice indeed. But still, I don't think it is needed
to code up solutions that large for problems that rare. Most
folks won't configure their systems to handle OOM situations, instead,
they'll configure them to have enough swap.

> Now granted that an admin should have set up resource limits, but....let's
> assume a new root exploit has come out as they tend to every other day on
> bugtraq. So. A new machine killer is born and more admins rip out their
> hair.

This is why a good-enough solution in kernelspace is preferable.
Killing something is bound to make someone unhappy, so not
making it configurable will place the blaim on me instead of
on some innocent sysadmin... Besides, it will force people to
carefully make sure they don't run out of swap often, which is
a far better solution.

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide. H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader. http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

Once again, for the folks who erased their sunday-mail:
--- mm/Makefile.orig Sun Aug 16 17:26:38 1998
+++ mm/Makefile Sun Aug 16 17:26:57 1998
@@ -9,7 +9,7 @@

O_TARGET := mm.o
O_OBJS := memory.o mmap.o filemap.o mprotect.o mlock.o mremap.o \
- vmalloc.o slab.o \
+ vmalloc.o slab.o oom_kill.o\
swap.o vmscan.o page_io.o page_alloc.o swap_state.o swapfile.o

include $(TOPDIR)/Rules.make
--- mm/oom_kill.c.orig Sun Aug 16 17:26:30 1998
+++ mm/oom_kill.c Sun Aug 16 18:24:05 1998
@@ -0,0 +1,133 @@
+/*
+ * linux/mm/oom_kill.c
+ *
+ * Copyright (C) 1998 Rik van Riel
+ *
+ * The routines in this file are used to kill a process when
+ * we're seriously out of memory. This gets called from kswapd()
+ * in linux/mm/vmscan.c when we really run out of memory.
+ *
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/stddef.h>
+#include <linux/swap.h>
+#include <linux/swapctl.h>
+#include <linux/timex.h>
+
+#define DEBUG
+/* Hmm, I remember a global declaration. Haven't found
+ * it though... */
+#define min(a,b) (((a)<(b))?(a):(b))
+
+typedef struct vm_kill_t
+{
+ unsigned int ram;
+ unsigned int total;
+} vm_kill_t;
+
+struct vm_kill_t vm_kill = {25, 3};
+
+inline int int_sqrt(unsigned int x)
+{
+ int out = x;
+ while (x & ~(unsigned int)1) x >>=2, out >>=1;
+ if (x) out -= out >> 2;
+ return (out ? out : 1);
+}
+
+/*
+ * Basically, points = size / (sqrt(CPU_used) * sqrt(sqrt(time_running)))
+ * with some bonusses/penalties.
+ *
+ * This is ugly as hell, and a nice cleanup is welcome :-)
+ */
+
+inline int badness(struct task_struct *p)
+{
+ int points = p->mm->total_vm;
+ points /= int_sqrt((p->times.tms_utime + p->times.tms_stime) >> (SHIFT_HZ + 3));
+ points /= int_sqrt(int_sqrt((jiffies - p->start_time) >> (SHIFT_HZ + 10)));
+ if (p->priority < DEF_PRIORITY)
+ points <<= 1;
+ if (p->uid == 0 || p->euid == 0 || p->cap_effective.cap & CAP_TO_MASK(CAP_SYS_ADMIN))
+ points >>= 2;
+ if (p->start_time < jiffies >> 6)
+ points >>= 2;
+/*
+ * NEVER, EVER kill a process with direct hardware acces. If
+ * we start doing that, we won't make a clean recovery and a
+ * sync + umount + reboot will be better.
+ */
+ if (p->cap_effective.cap & CAP_TO_MASK(CAP_SYS_RAWIO)
+#ifdef __i386__
+ || p->tss.bitmap == offsetof(struct thread_struct, io_bitmap)
+#endif
+ )
+ points = 0;
+#ifdef DEBUG
+ printk(KERN_DEBUG "OOMkill: task %d (%s) got %d points\n",
+ p->pid, p->comm, points);
+#endif
+ return points;
+}
+
+inline struct task_struct * select_bad_process(void)
+{
+ int points = 0;
+ struct task_struct *p = NULL;
+ struct task_struct *chosen = NULL;
+ read_lock(&tasklist_lock); /* We might need this on SMP */
+ for_each_task(p)
+ if (p->pid && badness(p) > points)
+ chosen = p;
+ read_unlock(&tasklist_lock);
+ return chosen;
+}
+
+/*
+ * The SCHED_FIFO magic should make sure that the killed context
+ * gets absolute priority when killing itself. This should prevent
+ * a looping kswapd from interfering with the process killing.
+ */
+void oom_kill(void)
+{
+
+ struct task_struct *p = select_bad_process();
+ if (p == NULL)
+ return;
+ printk(KERN_ERR "Out of Memory: Killed process %d (%s).", p->pid, p->comm);
+ force_sig(SIGKILL, p);
+ p->policy = SCHED_FIFO;
+ p->rt_priority = 1000;
+ current->policy |= SCHED_YIELD;
+ schedule();
+ return;
+}
+
+/*
+ * Are we out of memory?
+ *
+ * We ignore swap cache pages and simplify the situation a bit.
+ * This probably won't hurt, because when kswapd is failing we
+ * already have to assume the worst.
+ */
+
+int out_of_memory(void)
+{
+ struct sysinfo val;
+ int free_vm, kill_limit;
+ si_meminfo(&val);
+ si_swapinfo(&val);
+ kill_limit = min(vm_kill.ram * (val.totalram >> PAGE_SHIFT),
+ vm_kill.total * ((val.totalram + val.totalswap) >> PAGE_SHIFT));
+ free_vm = ((val.freeram + val.bufferram + val.freeswap) >>
+ PAGE_SHIFT) + page_cache_size - (page_cache.min_percent +
+ buffer_mem.min_percent) * num_physpages;
+ if (free_vm * 100 < kill_limit)
+ return 1;
+ return 0;
+}
+
+
\ No newline at end of file
--- mm/vmscan.c.orig Sun Aug 16 17:26:20 1998
+++ mm/vmscan.c Sun Aug 16 18:26:28 1998
@@ -28,6 +28,13 @@
#include <asm/bitops.h>
#include <asm/pgtable.h>

+/*
+ * OOM kill declarations. Move to .h file before submission :)
+ */
+
+extern int out_of_memory(void);
+extern void oom_kill(void);
+
/*
* When are we next due for a page scan?
*/
@@ -532,7 +539,7 @@
init_swap_timer();
add_wait_queue(&kswapd_wait, &wait);
while (1) {
- int tries;
+ int tries, tried, succes;

current->state = TASK_INTERRUPTIBLE;
flush_signals(current);
@@ -558,14 +565,16 @@
*/
tries = pager_daemon.tries_base;
tries >>= 4*free_memory_available();
-
+ tried = succes = 0;
+
while (tries--) {
int gfp_mask;

- if (free_memory_available() > 1)
+ if (free_memory_available() > 1 && ++tried > pager_daemon.tries_min)
break;
gfp_mask = __GFP_IO;
- do_try_to_free_page(gfp_mask);
+ if (do_try_to_free_page(gfp_mask))
+ succes++;
/*
* Syncing large chunks is faster than swapping
* synchronously (less head movement). -- Rik.
@@ -574,6 +583,8 @@
run_task_queue(&tq_disk);

}
+ if (succes < 4 * tried && out_of_memory())
+ oom_kill();
}
/* As if we could ever get here - maybe we want to make this killable */
remove_wait_queue(&kswapd_wait, &wait);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html