Re: mmap/write vs read/write supprise

Mark Hemment (markhe@nextd.demon.co.uk)
10 Jul 1997 23:47:30 +0200


Hi,

On Thu, 10 Jul 1997, Jim Nance wrote:
> At least with my test program, it seems like using the mmap/write
> method shows the best performance gain when the file is small (ie 400 bytes).
> As the file size grows, the performance gain decreases, and when the
> file size get close to 1M, it becomes faster to use read/write rather
> than mmap.

For sequentail access to the file, I would expect the read() (file
I/O) method to be faster. This is because the kernel performs much more
read-ahead for files accessed by this method than mmap(). For mmap(),
only one page is read-ahead of the faulting address.

It is possible to implement page-fault prediction per vm-area, with the
kernel reading ahead further has it becomes more sure of the faulting
pattern.
o When a fault occurs the faulting address is stored in the
vm-area structure.
o If the faulting address is the one expected, then increase
the read-ahead distance (or read-behind if the file is being
accessed backwards), and start I/O on the predicated pages if
they are not already incore (or I/O locked, which indicates they
are "on their way"). Based upon the success, calculate the next
faulting address.
o If the faulting address is not the one expected, then decrease
(throttle back) the read-ahead/behind distance. (Or maybe,
even change the fault prediction direction).

If the mmap()ed file has no (determinable) access pattern, then the
read-ahead/behind will not kick in.
(Note: Because of VM_CLONE the faulting stats are not really per vm_area,
but per reference to a vm_area - nasty!).

With the current design of the page-cache, this has a small problem.
Unmapped (that is pages which are not part any user address-space)
pages are not 'aged' in the way (currently) mapped pages are. Their only
defence against being reaped is the 'PG_referenced' bit. This means pages
read in with the hope they will be needed soon are quickly shreaded if
memory becomes low. (This, of course, also happens with traditional file
I/O pages).
To compound this, more free-pages are needed for the read-ahead. A
partial solution here is to add another allocation priority that does not
try very hard to find a free page. (Infact, the priority should decay as
the 'distance' of the original fault increases).

It is possible to get very crafty. If the access pattern to a mmap()ed
file can be determined, then it is possible to fill-ahead the PTEs. That
is, on a page-fault where the vm_area has sequential access and
the next (predicated for faulting) pages are already incore, it is
possible to map them into the faulting tasks address-space. This avoids
later page-faults by handling them all in a single 'chunk'.
Of course, this does have problems (such as we shouldn't really cross
page-tables boundaries when doing this, and it changes the weights need
for page-reaping - there are less unmapped pages in the page-cache for
shrink_mmap() to reap, so the kernel becomes more dependent on
try_to_swap_out() which has a poor ratio of success-to-CPU-cycles. It
can also mess up scheduling slightly).

It's not that difficult to implement - just a bugger to tune...

Regards,

markhe

------------------------------------------------------------------
Mark Hemment, Unix/C Software Engineer (Contractor)
markhe@nextd.demon.co.uk http://www.nextd.demon.co.uk/
"Success has many fathers, failure is a B**TARD!" - anon
------------------------------------------------------------------