Swap problem with pre-2.0.31

Rauli Ruohonen (raulir@fishy.pp.sci.fi)
Tue, 10 Jun 1997 18:53:18 +0300 (EET DST)

I was again poking around with pre-2.0.31 (the first one with all patches
released before second pre-2.0.31), and again sent named binary data,
causing it to increase its size to ~6500 KB. Then I ^C:d the binary
sender program ("stressport localhost 53 -binary"), the HD was doing
something (not much), and the system was slowed down to crawl.

I tried to kill named with "killall -9 named", but as the first letter
appeared on the shell (realtime priority 99) after 10 seconds of waiting,
I went to the kitchen and ate a little :) When I returned, I saw this on
the screen: "try_to_free_page: state (3) stop (1847308) i(0) sleep
instead of fail". The values are correct, as the printk parameters are ok
in this kernel. The same message flooded the screen, by pressing
ALT+ScrollLock I saw only a glimpse of the output. VC switchin worked.
Nothing else (C-A-D) worked, and the HD was silent - time for HW reset.

I was able to get this info before the system crashed (named was ~4MB):

5:35pm up 1:16, 3 users, load average: 2.74, 2.37, 1.64
60 processes: 57 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 2.0% user, 15.3% system, 16.1% nice, 82.9% idle
Mem: 18712K av, 18552K used, 160K free, 456K shrd, 164K buff
Swap: 31184K av, 12300K used, 18884K free 344K cached

The 3 running means: top, stressport and named. Nothing else was running,
except a few shells, init and such.

I have only been able to produce this with named, I have tried to make a
memory-eater SW but it didn't have the same effect, and the newest
version of named seems to have been fixed so it doesn't eat the memory
when it gets binary data. Yes, the version of named I have is buggy, but
it shouldn't crash the kernel this way.

I patched the kernel with the David's patch (no, this isn't 386 with 4MB
of memory but a 486 DX2 with 20MB of memory).
The behavior changed: now the system is crawling only when the stressport
is running and named >4MB, so when I ^C the stressport, named is almost
completely swapped out and the system stops swapping. The killing still
takes some time, but named actually gets killed and everything works
normally after that.

It looks like the named is always swapped out right after it's swapped in so
its execution becomes _REALLY_ slow and system is constantly swapping.
I'm no kernel expert though, so this is just speculation :)
No, the slowdown isn't normal slowdown caused by swapping, I've never
seen this big slowdown when only a few megs get paged, so don't just mail
me and say "that's normal, when you use HD as memory it's always slow"..
(here slowdown=top draws one line/minute and characters are shown by the
shell at 0.01 cps)