Re: mmap() versus read()

Larry McVoy (lm@who.net)
Sun, 08 Mar 1998 18:14:53 -0800


: I just perused the solaris madvise() man page and I don't think it would
: help apache in the long run. Specifically:
:
: MADV_SEQUENTIAL
: Tells the system that addresses in this range are
: likely to be accessed only once, so the system will
: free the resources mapping the address range as quickly
: as possible.

If I remember correctly from my SunOS kernel hacking days, this call
doesn't do much. The original idea was that the madvise would turn
accesses to the file into a sliding window, such that the pages behind
the current page in requet would get paged out. This is almost certainly
not what you want for apache.

Anyway, this call never worked very well in SunOS and I doubt they have
fixed it in Solaris. I'd be interested in hearing if someone can come
up with a test program that shows different behavior with and w/o the
madvise.

I made the call largely redundant, in UFS, by putting in automatic
free behind for sequential accesses. This was a huge hack and a huge
win. The idea was this:

. the file system eeps track of sequential access in the inode
(almost all file systems do this for read ahead).

. the file system "knows" when the VM system is going to start
being starved for memory and start paging stuff out.

. on large (>256K 6 years ago, probably .5M or 1M today) files
that are being accessed sequentially, the file system looks at
free memory on each page in request, if it is close (within
about 1M or so) to being all used up, the file system looks
backwards a "windows" worth (say .25M) and frees up the pages
behind.

The effect is that memory fills up with whatever and then when large I/O
happens, the OS just slides a .5M window through the file.

A refinement would be to put the file in "free behind" mode, like you
put it in read ahead mode. It's pretty worthless to have part of a file
in memory, so if you start pushing it out, you should probably not
push part of it out.

If someone is interested in talking more about this, mail me or post
to this thread.

: In apache-1.3 we still use the multiple process model, and so an mmap() is
: only in use by a single process at a time. If multiple requests are
: active on the same file there are multiple mmap()s for the file. This is
: a bit of a waste...

It's a waste in VM data structures only. And that could be fixed by
using cloned processes that share VM.

: multithreaded. In this case madvise() is perfect because each task gets
: to say "I'm using it sequentially" and then do something sequentially

The call, in my opinion, is backwards. The default behavior should
(and almost always is) be to assume sequential behavior. The OS
always reads ahead for you any way. It's the random case that you
want to warn the OS about - to tell it to skip the read ahead.

: Note that even if linux gets a sendfile/TranmitFile syscall the mmap()
: case is still important -- specifically for mmap()d databases. We're
: designing a copy-avoiding I/O layer for apache-2.0 and support for
: non-filesystem objects is on the list; as is support for sendfile... it's
: all vapourware at the moment though.

When you gys get around to this, please talk to me about it, I have done
some work in this area and can share some data.

--m

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu