[RFC] On paging of kernel VM.

From: David Woodhouse (dwmw2@infradead.org)
Date: Mon Sep 09 2002 - 04:20:53 EST


I think I'd like to introduce 'real' VMAs into kernel space, so that areas
in the vmalloc range can have 'real' vm_ops and more to the point a real
nopage function.

Unfortunately AFAICT this would involve changing the fault handler on every
platform, so I'm debating whether it's really worth it -- if anyone else
could use it and if I could get round my problem any other way.

The problem is flash chips. These basically behave as ROM, but you write to
them by writing magic values to magic addresses, and during a write
operation the _whole_ chip returns status bits instead of data.

To avoid taking up precious RAM with copies of data which are already in
flash, we can map pages of flash directly into userspace. On taking a
fault, we wait for any pending write to complete, mark the chip as busy,
then set up the page tables appropriately so that userspace can read from
it. On starting a write operation, you invalidate all currently-visible
pages before starting to talk to the chip.

There are cases in the kernel where we'd really like the same setup --
mounting a JFFS2 file system, for example, is a slow operation because it's
entirely log-structured and we have to read every log entry on the file
system. The current method of reading into a RAM buffer under a lock and
then dealing with stuff in RAM is entirely suboptimal, and proof-of-concept
hacks to just use a pointer into the flash chip have been observed to
improve mount time by about a factor of 4.

The locking is a problem though. Flash chips may be divided into multiple
partitions and other code may want to write to its partition while a mount
is in progress. The naïve approach of just locking the chip into read mode
on giving out a pointer to it, and unlocking it when the mount is complete,
is going to suck royally. Hence, it would be very nice if we could play the
same trick as we do for userspace; giving out a pointer which is always
going to be valid; you just might have to wait for it.

But as I said, this means screwing with every fault handler. It doesn't
have to affect the fast path -- we can go looking for these vmas only in
the case where we've already tried looking for the appropriate pte in
init_mm and haven't found it. But it's still an intrusive change that would
need to be done on every architecture.

I'm wondering what else could use this if it were implemented. Is there any
need for something like vmalloc_pageable(), for example? Anything else?
Rusty and I have wittered about marking certain kernel functions and data as
__pageable to go into a special such section too, but I'm wondering if that
conversation was slightly Guinness-influenced :)

Or is there another way to solve my original problem that I've overlooked?

Answers on a postcard to...

--
dwmw2

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 15 2002 - 22:00:16 EST