Ideas for reducing memory copying and zeroing times

Jamie Lokier (jamie@rebellion.co.uk)
Tue, 16 Apr 96 01:35 BST

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Jamie Lokier: "Slow ethernet/NE2000 with all 1.3 kernels (1.2 is fine)"
Previous message: Peter von der Ahe: "Re: Very strange 1.3.8[78] problem --- can't log in!"
Next in thread: Werner Almesberger: "Re: Ideas for reducing memory copying and zeroing times"
Reply: Werner Almesberger: "Re: Ideas for reducing memory copying and zeroing times"
Maybe reply: Robert L Krawitz: "Re: Ideas for reducing memory copying and zeroing times"
Reply: Ingo Molnar: "Re: Ideas for reducing memory copying and zeroing times"
Reply: Mathew G Monroe: "Re: Ideas for reducing memory copying and zeroing times"

Reading
=======

If all this memory to memory copying is taking a significant time, how
about the following optimisation: remap the pages using the MMU for
aligned, page-sized portions of the copied area. This would give some
programs that use `read' the some of the efficiency of `mmap'. If the C
library is tweaked to ensure that stdio buffers are page aligned, this
could be a really effective optimisation.

Of course, this is another way in which pages are shared and it might
just complicate the scheme of things a tad. ;-)

Writing
=======

`write' doesn't benefit in quite the same way. Assume that a page to be
written starts out zero-mapped (see below for zero-mapping ideas), is
filled with data, and then written. If this happens only once then it
is worth using the MMU to share the page with the page-cache. If the
page is filled again though, it has to be copied (as a copy-on-write
page) and all you have gained is that the I/O potentially got started
earlier. Of course, all writes (including NFS) will be delayed in
future anyway, won't they? :-)

Using the MMU for `write' might be worthwhile anyway, because there are
special circumstances when the copy can be avoided. Programs which read
and write about the same amount of data (i.e., file servers) tend to
read into the same areas they use for writing. Provided `read' is using
the MMU as well, there is no need for the process to copy the data it
wrote earlier unless the new write is shorter than the old read. Then
it is only as much as a page's worth of data.

If the program knows it isn't interested in the data it just wrote, it
could issue an alternative `write_and_zero' system call which remaps the
page and replaces it with a zero-mapped page. Real programs won't do
that, of course, because it isn't standard. But the stdio library could
do it (so a lot of programs would benefit), and some other software such
as file servers, `cat' and `dd' could be modified to make use of it.

Zero-mapped pages
=================

Well, copy-on-write of zero-mapped pages obviously happens a great deal.
So it's worth writing the fastest page-zeroing code that anyone can
think up. (I haven't timed it, but it seems to me that even the
`memset' in <asm-i386/strings-i486.h> might go faster on a Pentium if it
is unrolled a little and uses paired writes, simply because many of the
zeroes may well get written to the internal cache during the loop, and
get written to secondary cache, etc., later while other code is happily
doing other things in the internal cache).

Apart from that though, how about having the idle task (or a
low-priority kernel thread) fill out a pool of pre-zeroed pages. When a
process needs a zero page, if there are any in the pool it can have one
immediately by remapping a page -- no copy on write required. Of
course, under constant load the pool would be empty so you still need
the fast zeroing code. At least at the start of a burst of activity
there would be a much reduced zeroing time (such as when a program
starts up and fills its data area). And with SMP even if all but one
CPU is loaded, there might be a spare one with enough idle time to keep
the pool going for the others.

Network skbuffs
===============

Having implemented all of the above (you, not me :-), the icing on the
cake is then to have receiving skbuffs allocated in such a way that the
data part of the packet from a device happens to have just the right
page alignment when it comes in... You get the idea. With this,
reading data over NFS and rsize=4k or rsize=8k into a process requires
absolutely no internal data copying at all! The data comes off the
ethernet card (well, that bit requires some I/O or copying from the
card, or maybe some cards can use bus-mastering DMA -- what checksum?
:-). A few page remaps later, it is in the page-cache having got
through the net subsystem. One more page mapping and it is in the
process which wanted the data.

Writing is similar.

Just some ideas for the common good,
For post 2.0, I guess,
Enjoy,

-- Jamie Lokier

Next message: Jamie Lokier: "Slow ethernet/NE2000 with all 1.3 kernels (1.2 is fine)"
Previous message: Peter von der Ahe: "Re: Very strange 1.3.8[78] problem --- can't log in!"
Next in thread: Werner Almesberger: "Re: Ideas for reducing memory copying and zeroing times"
Reply: Werner Almesberger: "Re: Ideas for reducing memory copying and zeroing times"
Maybe reply: Robert L Krawitz: "Re: Ideas for reducing memory copying and zeroing times"
Reply: Ingo Molnar: "Re: Ideas for reducing memory copying and zeroing times"
Reply: Mathew G Monroe: "Re: Ideas for reducing memory copying and zeroing times"