Re: 2.1.78: mm and networking questions...

Linus Torvalds (torvalds@transmeta.com)
8 Jan 1998 17:54:38 GMT


In article <199801080258.UAA02216@jadrek.kwr>, <kwrohrer@enteract.com> wrote:
>
>(1) There's just one "struct page" per physical page? And there's
> an array "mem_map" of these, indexed redundantly by
> MAP_NR(address) and by struct page::map_nr?

Yes. Although it's not really redundant per se, it's an optimization.
page->map_nr can be calculated from page ("page-page_map" - we used to
do it), but that involves a division that is not a power of two and
actually showed up very clearly as a performance problem on some code.

>(2) What on earth is "mem_map_t" doing, and why should this alias for
> "struct page" exist?

Historical reason. "mem_map_t" used to be just a "unsigned short", and
contained only the page count. Then it became a structure that contained
the page count and the "reserved bit", and finally it became the current
"struct page". The old name exists so that I didn't have to do too much
of a search-and-replace when I did the changes.

>(3) Would performance suffer horribly if the struct page were to have
> a more even (14 or 16) number of words in it, or would we get
> back performance by making the cache line boundaries fall in the
> right places?

It would probably be ok, and if the size of the structure was guaranteed
to be a power of two (on _all_ architectures - remember 64-bit issues)
then we could drop the map_nr entry because getting the map_nr would be
trivial with pointer arithmetic and a shift.

>(4) Similarly to (1) I take it there's exactly one struct mm_struct per
> struct task_struct, and each of the struct vm_area_struct
> *mmap points to a chain of vma's unique to the task?

No. Several task_struct's can share the same mm_struct when you use
clone() (check out LinuxThreads, for example). But yes, it's a 1:n
relationship, so each task is associated with just one mm_struct (the
reverse isn't true) if that was what you were after.

>(5) When we start to swap a page out to disk, if the process wants
> to write to that page, what happens? I can't find anything
> to prevent the access, nor can I find anything that would
> notice such an access, until the disk I/O completes and the
> page gets replaced or hits the swap cache...

We mark the page not present before the write, and the swap-out has a
per-page lock bit - so if something tries to access it (read or write),
it will get swapped in again (and the lock-bit makes sure that these
operations are synchronized on that particular page).

>(6) Similarly, if I were to pte_mkold some "innocent" pages to
> encourage them to be copied out to disk, would there be
> a major penalty (besides perhaps a wasted disk write) involved
> if they were still in frequent use? If I'm right about (1)
> then artificial aging should encourage free areas of the
> desired size, without the need for a reverse page table...

That should work.

>(7) If we had a reverse page table, and could walk physical memory in
> search of stuff to swap out, might that lead to better balancing
> between different sorts of pages (e.g. process vs. buffer cache)?
> (Not to mention, a bit faster?)

Yes. We could try to do what we currently do in "vmscan.c", but without
walking the page tables by hand.

>(8) As reported by shift-ScrollLock, should the "IP fragment buffer size"
> really be 0 all the time? The machine's networks are all point-
> to-point with an always-defragment firewall, so it makes some
> sense...but alas all the unassembled fragment buffers in the
> world don't make for a single skbuff to stick them in.

It should be non-zero only if you have fragments that haven't been
re-assembled yet. And fragments are _extremely_ rare - TCP will never
create them, and UDP creates them only for large packets. In short,
fragments should almost never happen under normal load _except_ for NFS.

Even then, fragments should have a short life-time unless one or more of
the fragments get lost.

To see fragments, use NFS with a rsize/wsize > 1500, and use NFS
_heavily_ while pressing shift-ScrollLock.

>(9) Does paging in recent (2.1.>50 or so) kernels seem a lot slower than
> in the good old days? My swap disk can manage several megabytes
> per second, sustained, but e.g. the backdrop in X pages in at
> around an inch per second at times. Is /proc/sys/vm/freepages
> set too low (especially the latter values)? Or is there some
> other condition that's limiting page-in speed?

This might be a good thing to try to tweak.

Linus