vm idea's

RHS Linux User (kernel@kvack.org)
Mon, 10 Nov 1997 02:27:46 -0500 (EST)


On Mon, 10 Nov 1997, Rik van Riel wrote:
...
> Another large kswapd 'bug' is that it doesn't do aging on mmapped
> files (see shrink_mmap) and that it uses too much CPU time
> (try_to_free_page IS run under a kernel lock!!).
> I fixed those two things this spring for version 2.1.42, but
> unfortunately this patch:
> - depended on the 2.1-mem-mgt memory wait-queue patch (a good
> piece of code, but it sometimes broke under _extreme_ load)
> - got erased when I upgraded after the summer holydays :-(

I did something similar against 2.1.47 or so - wanted to be able to run X
on my 8 meg machine at the time. But it had a somewhat larger goal: do
things in a mem_map oriented fashion instead of the old page table and
mem_map way. It show promising signs, had somewhat high overhead for the
pte linking, but worked beautifully on my 486 (was actually reliable under
the heavy X and everything over NFS load, with swap on a slow IDE drive).
Well, I got busy and there wasn't much feedback, so I let it slip. (I
called it page-centric mm or somesuch at the time.)

> It basically consisted of 3 things:
> - implementation of balancing buffer and cache memory: if
> more than x% of memory is buffer/cache, we take that memory
> first, if less than y% of memory is buffer/cache, we leave
> it alone. (50% and 10% seemed to be good default values...)

Ummm, in my experience, setting any sort of limit on the buffer cache/page
cache is a bad idea. Off the top, news servers tend to lose big. If
aging is working correctly, all will balance properly anyways.

> - a new kernel thread, vhand, which used the linear table to
> (efficiently!) scan and age memory. It also tuned itself
> to make sure that a 'good' fraction of the pages had age 0.
> (tunable by sysctl)

That's exactly what I did. This has a noticable effect, as when swapping
would begin it wasn't randomly selecting pages. Also, the horrible
skewing for shared pages was negated. (Shared pages tend to stick around
too long under the current scheme, as it can be a long time before the
mappings for them are walked over.) The double writing of shared mmapings
also went away.

> - shrink_mmap in mmap.c also used page-aging... this gave
> quite some improvement, especially under higher loads.
>
> ... I haven't taken a look at shared memory yet, and I
> don't know what effect the usage of aging could have
> on shm-performance. (since these pages are used by
> more programs, we should look even more carefully at
> these???)

Same thing as shared pages - some will be frequently used, others rarely.
Possibly using a formula based on frequency of use and time since lasted
used could improve things.

> ... We could improve fairness (and performance) by taking
> pages from programs on a faults/megabyte or other ratio,
> currently we just take pages from whatever program we
> happen to at. (would this be worth the effort / cost)

Better would be to disable/suspend one process when several cause
thrashing. This would extend into scheduling and probably be a lot of
work.

> ... Would making an inactive list (pages to swap out next),
> so kswapd doesn't have to do it's inefficient scanning, be
> worth it? (probably not?)

In fact it would help. By speculatively writing out pages ahead of time
it becomes possible to reclaim these pages without sleeping or blocking -
even in interrupts. This gives GFP_ATOMIC allocations for skbuffs and
other interrupt driven memory needs a greater chance of succeeding.

> ... Would implementing RSS_LIMIT for users/programs be
> good for system performance?? Freeing up memory for (small,
> interactive) programs of other users might be good, but not
> if we thrash the I/O system by continuously swapping some
> big (background) number crunching application.
> ... KEYWORD: balancing. which is worth it, which isn't?
> other keywords: small kernel, fast kernel.

I remember someone mentioning that kswapd as designed was really meant for
systems with 4-12 megs of RAM. Most of the ideas proposed here will work
quite well on most systems, but not on low memory machines. (Esp
speculative paging.) But recent systems start at 16 megs, with 32 being
the norm, so this should be the target. More people seem to be using
Linux on large memory systems too, and they shouldn't be left out in the
cold. (someone posted recently about a machine with 320 megs of RAM -
ouch!)

> I'm now asking all of you what features should be implemented
> first. Also, the wait-queue for memory was a very good piece
> of code. If someone would care porting it to the newer kernels,
> we'd all live in a happier world.

The 'type' of a page allocation should be taken into account to help
reduce fragmentation. Eg: long term slab pages vs short term network
buffers vs relocatable user pages.

>
> Rik.
>
> ps. the From: header doesn't yet correct, I will have to get
> sendmail working first :-)
> pps. if you think this posting should be forwarded to the
> kernel list or to other core-mm people, please do so.

Does a mailing list exist for people doing mm work? If so, I'd like to
hear of it - mm development seems neglected in 2.1 (except for the
addition of the SLAB), and (IMHO) needs some more work before 2.2. I
cringe every time someone reports a 'discovery' of the sound 'Unable to
allocate DMA buffer' bug. And the network stack's handling of fragmented
packets must be improved. (and NFS replaced....)

-ben