Re: __commit_write() with the Page Cache

From: Jeff V. Merkey (jmerkey@timpanogas.com)
Date: Thu May 11 2000 - 19:16:42 EST


Linus Torvalds wrote:
>
>
>
> Hmm.. Can you explain more exhaustively what you're trying to do?

I'm plugging NetWare mirroring into the page cache directly, so page
cache events writepage(), prepare_write(), and commit_write() will post
simultaneous asynch write requests across several devices all at the
same time. They don't have to do it synchronously, mirrored writes can
be written in a lazy manner, so what you've got there should work OK,
provided I can get the flushing to work correctly.

>
> You do need to expose your buffer heads in "page->buffers" if you want the
> MM layer to know about them. Sadly, right now there is no other way for
> the MM layer to really put back-pressure on the filesystem. Otherwise, you
> need to have some other mechanism for doing dirty block flushing.

I do expose them via page->buffers -- this is filled in vi the call to
block_prepare_write().

>
> Without using "page->buffers", I don't understand how you expect to keep
> track of which pages you already have created the write-out buffers for..

I am doing exactly this.

>
> > The obvious and simple solution is to provide a VARIABLE SECTOR LENGTH
> > buffer head to ll_rw_block() without the onerous "block size
> > 512-1024-2048-4096" restriction that requires TONS of memory to be
> > NEEDLESSLY sucked up and used for buffer head chains, when a simple
> > field (like b_size) is already present to tell the driver the block
> > size. I don't understand why it wasn't implemented this way in the
> > first place -- such would seem intuitive to me.
>
> The reason for the fixed size buffers is that the MM layer historically
> needed them for memory management - buffers were re-used freely (unlike
> now, when the page is actually free'd etc), and a fixed-size-per-page was
> the only sane alternative.
>
> I don't think ll_rw_block() really cares, as long as the buffer size is a
> multiple of 512 bytes. And the memory management should be able to take
> any kind of buffer head list, as long as it's a regular circular list
> thing. So it should be possible to just create a page that has a buffer
> list that looks like 2048+1024+512+512 bytes, for example.

This makes some sense - it would allow a "scatter gather" semantic for
memory from several locations to be combined into a single disk write --
this makes sense. If I understand correctly what you are saying here,
then I could conceivably create a buffer head chain of a very large
size?

>
> There might still be some historical remnants that coul dget confused -
> like the buffer locking mechanism that has a per-size lock, but that code
> is probably ripe for removal anyway (it used to make sense back when we
> did memory management on the buffers, but now pretty much all the memory
> management should be page-based, so..)

I'll implement it, and see what gets busted -- Hope the kernel mailing
list archive has a big hard disk to handle all the emails from mad
driver writers we piss off.

>
> > (1) can't be done. You get a variable-length buffer_head array right now,
> and that's what you get. One buffer-head can point only to one contiguous,
> area, a multiple of 512 bytes in size, and less than or equal to 1 page in
> size for MM reasons (or rather, not one page, but one "page cache entry",
> which at this point is the same as one page).

To support the scatter gather semantic referenced above, this is an
acceptable tradeoff. I did not know that I could endlessly chain buffer
heads -- some of the drivers I've reviewed are going to break when I do
this because they seem to assume that a buffer heads list won't link to
more than one contiguous page of memory.

>
> (2) is what the filesystem is supposed to do. No "bmap" callback, because
> the filesystem should just look up the right bh's itself. That's basically
> what ext2 does right now, except it does it with this helper routine
> called "block_prepare_write()".

It would be nice if your page cache would support what I've done in the
LRU * structure, i.e. a single logical file offset (which the page cache
provides) that can be mapped to multiple devices. At present, I allow
up to eight concurrent AIO requests per logical block so I can perform
mirroring across several devices without double-triple-quadruple, etc.
buffering of the data. basically implement up to 8 page->buffers
pointers in the page structure page->buffer1, page->buffer2, etc. and
put the smarts in the buffer cache to flush all the buffer head chains
asynchronously, and signal wait_on_buffer() waiters when all the writes
complete. I know how the md drivers work, unfortunately, since the
buffer cache is physical rather than logical, it gives netware mirroring
fits since a single logical volume block can map to entirely different
physical lba offsets accross disks (all NetWare partitions will not be
at the same physical offsets across devices. If this capability could
be implemented, I'll throw away the NWFS LRU and use yours.

>
> (3) should in theory work, you just need to fill in page->buffers for it,
> and right now that may or may not do what you want. See the 99-pre7 code
> to sync the dang thing for you - it may be more along what you want done.
> And right now there is no way to "plug" the write - once page->buffers has
> been filled in, the write-back can happen at any time.

Who does the write back? It looks like it's the buffer cache. So Pre7
fixes this flushing problem?

Jeff
>
> Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon May 15 2000 - 21:00:19 EST