Re: Howto prevent kernel from evicting code pages ever? (to avoid disk thrashing when about to run out of RAM)

From: Marcus Linsner
Date: Mon Sep 10 2018 - 05:18:25 EST


On Wed, Aug 22, 2018 at 11:25 AM Marcus Linsner
<constantoverride@xxxxxxxxx> wrote:
>
> Hi. How to make the kernel keep(lock?) all code pages in RAM so that
> kswapd0 won't evict them when the system is under low memory
> conditions ?
>
> The purpose of this is to prevent the kernel from causing lots of disk
> reads(effectively freezing the whole system) when about to run out of
> RAM, even when there is no swap enabled, but well before(in real time
> minutes) OOM-killer triggers to kill the offending process (eg. ld)!
>
> I can replicate this consistently with 4G (and 12G) max RAM inside a
> Qubes OS R4.0 AppVM running Fedora 28 while trying to compile Firefox.
> The disk thrashing (continuous 192+MiB/sec reads) occurs well before
> the OOM-killer triggers to kill 'ld' (or 'rustc') process and
> everything is frozen for (real time) minutes. I've also encountered
> this on bare metal myself, if it matters at all.
>
> I tried to ask this question on SO here:
> https://stackoverflow.com/q/51927528/10239615
> but maybe I have better luck on this mailing list where the kernel experts are.
>

This is what I got working so far, to prevent the disk thrashing
(constant re-reading of active executable pages from disk) that would
otherwise freeze the OS before running Out Of Memory:

the following patch can also be seen here:
https://github.com/constantoverride/qubes-linux-kernel/blob/devel-4.18/patches.addon/le9d.patch

revision 3
preliminary patch to avoid disk thrashing (constant reading) under
memory pressure before OOM-killer triggers
more info: https://gist.github.com/constantoverride/84eba764f487049ed642eb2111a20830

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..7636498 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -208,7 +208,7 @@ enum lru_list {

#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)

-#define for_each_evictable_lru(lru) for (lru = 0; lru <=
LRU_ACTIVE_FILE; lru++)
+#define for_each_evictable_lru(lru) for (lru = 0; lru <=
LRU_INACTIVE_FILE; lru++)

static inline int is_file_lru(enum lru_list lru)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 03822f8..1f3ffb5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2086,9 +2086,9 @@ static unsigned long shrink_list(enum lr
struct scan_control *sc)
{
if (is_active_lru(lru)) {
- if (inactive_list_is_low(lruvec, is_file_lru(lru),
- memcg, sc, true))
- shrink_active_list(nr_to_scan, lruvec, sc, lru);
+ //if (inactive_list_is_low(lruvec, is_file_lru(lru),
+ // memcg, sc, true))
+ // shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}

@@ -2234,7 +2234,7 @@ static void get_scan_count(struct lruvec
*lruvec, struct mem_cgroup *memcg,

anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, MAX_NR_ZONES);
- file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
+ file = //lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES) +
lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, MAX_NR_ZONES);

spin_lock_irq(&pgdat->lru_lock);
@@ -2345,7 +2345,7 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
sc->priority == DEF_PRIORITY);

blk_start_plug(&plug);
- while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
+ while (nr[LRU_INACTIVE_ANON] || //nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
unsigned long nr_anon, nr_file, percentage;
unsigned long nr_scanned;
@@ -2372,7 +2372,8 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
* stop reclaiming one LRU and reduce the amount scanning
* proportional to the original scan target.
*/
- nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
+ nr_file = nr[LRU_INACTIVE_FILE] //+ nr[LRU_ACTIVE_FILE]
+ ;
nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

/*
@@ -2391,7 +2392,8 @@ static void shrink_node_memcg(struct pglist_data
*pgdat, struct mem_cgroup *memc
percentage = nr_anon * 100 / scan_target;
} else {
unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
- targets[LRU_ACTIVE_FILE] + 1;
+ //targets[LRU_ACTIVE_FILE] +
+ 1;
lru = LRU_FILE;
percentage = nr_file * 100 / scan_target;
}
@@ -2409,10 +2411,12 @@ static void shrink_node_memcg(struct pgl
nr[lru] = targets[lru] * (100 - percentage) / 100;
nr[lru] -= min(nr[lru], nr_scanned);

+ if (LRU_FILE != lru) { //avoid this block for LRU_ACTIVE_FILE
lru += LRU_ACTIVE;
nr_scanned = targets[lru] - nr[lru];
nr[lru] = targets[lru] * (100 - percentage) / 100;
nr[lru] -= min(nr[lru], nr_scanned);
+ }

scan_adjusted = true;
}


Tested on kernel 4.18.5 under Qubes OS, in both dom0 and VMs. It gets
rid of the disk thrashing that would otherwise seemingly-permanently
freeze a qube (VM) with continous disk reading (seen from dom0 via
sudo iotop). With the above, it only freezes for at most 1 second
before OOM-killer triggers and restores the RAM by killing some
process.

If anyone has a better idea, please let me know. I am hoping someone
knowledgeable can step in :)

I tried to find a way to also keep Inactive file pages in RAM, just
for tests(!) but couldn't figure out how (I'm not a programmer).
So, keeping just the Active file pages, seem good enough for now, even
though I can clearly see (via vm.block_dump=1) that there are still
some pages that are being re-read during high memory pressure, but
they for some reason don't cause any(or much) disk thrashing.

Cheers!