Re: __commit_write() with the Page Cache

From: Linus Torvalds (
Date: Thu May 11 2000 - 19:00:33 EST

On Thu, 11 May 2000, Jeff V. Merkey wrote:
> I don't know why it was done this way, but it appears that unless buffer
> heads are allocated from the buffer cache, submitting your own via
> commit_write() doesn't seem to work. perhaps you could enlighten me as
> to why buffers aren't getting flushed as they should. after stepping
> though the code, it appears that external buffer heads cannot be used
> with the page cache.

Hmm.. Can you explain more exhaustively what you're trying to do?

You do need to expose your buffer heads in "page->buffers" if you want the
MM layer to know about them. Sadly, right now there is no other way for
the MM layer to really put back-pressure on the filesystem. Otherwise, you
need to have some other mechanism for doing dirty block flushing.

Without using "page->buffers", I don't understand how you expect to keep
track of which pages you already have created the write-out buffers for..

> The obvious and simple solution is to provide a VARIABLE SECTOR LENGTH
> buffer head to ll_rw_block() without the onerous "block size
> 512-1024-2048-4096" restriction that requires TONS of memory to be
> NEEDLESSLY sucked up and used for buffer head chains, when a simple
> field (like b_size) is already present to tell the driver the block
> size. I don't understand why it wasn't implemented this way in the
> first place -- such would seem intuitive to me.

The reason for the fixed size buffers is that the MM layer historically
needed them for memory management - buffers were re-used freely (unlike
now, when the page is actually free'd etc), and a fixed-size-per-page was
the only sane alternative.

I don't think ll_rw_block() really cares, as long as the buffer size is a
multiple of 512 bytes. And the memory management should be able to take
any kind of buffer head list, as long as it's a regular circular list
thing. So it should be possible to just create a page that has a buffer
list that looks like 2048+1024+512+512 bytes, for example.

There might still be some historical remnants that coul dget confused -
like the buffer locking mechanism that has a per-size lock, but that code
is probably ripe for removal anyway (it used to make sense back when we
did memory management on the buffers, but now pretty much all the memory
management should be page-based, so..)

> I don't know how ugly such a change would be, or how many drivers it
> would bust, but it would certainly make it a lot easier moving forward
> for all kinds of performance optimizations. At present, I've had to
> implement an extra flush daemon to flush the dirty pages with mirroring
> enabled. I would rather have a buffer head that will:
> 1). allow variable length sector writes/reads up to 128 sectors in a
> run.
> 2). allow logical LBA (vs. physical) that will call bmap() prior to
> write and support multiple locations for a single page to be written.
> 3). allow externally allocated buffer heads to be posted via
> commit_write() and actually get flushed.

(1) can't be done. You get a variable-length buffer_head array right now,
and that's what you get. One buffer-head can point only to one contiguous,
area, a multiple of 512 bytes in size, and less than or equal to 1 page in
size for MM reasons (or rather, not one page, but one "page cache entry",
which at this point is the same as one page).

(2) is what the filesystem is supposed to do. No "bmap" callback, because
the filesystem should just look up the right bh's itself. That's basically
what ext2 does right now, except it does it with this helper routine
called "block_prepare_write()".

(3) should in theory work, you just need to fill in page->buffers for it,
and right now that may or may not do what you want. See the 99-pre7 code
to sync the dang thing for you - it may be more along what you want done.
And right now there is no way to "plug" the write - once page->buffers has
been filled in, the write-back can happen at any time.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at

This archive was generated by hypermail 2b29 : Mon May 15 2000 - 21:00:19 EST