Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

From: Linus Torvalds (
Date: Tue Feb 06 2001 - 15:59:02 EST

On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> The second is that bh's are two things:
> - a cacheing object
> - an io buffer

Actually, they really aren't.

They kind of _used_ to be, but more and more they've moved away from that
historical use. Check in particular the page cache, and as a really
extreme case the swap cache version of the page cache.

It certainly _used_ to be true that "bh"s were actually first-class memory
management citizens, and actually had a data buffer and a cache associated
with them. And because of that historical baggage, that's how many people
still think of them.

These days, it's really not true any more. A "bh" doesn't really have an
IO buffer intrisically associated with it any more - all memory management
is done on a _page_ level, and it really works the other way around, ie a
page can have one or more bh's associated with it as the IO entity.

This _does_ show up in the bh itself: you find that bh's end up having the
bh->b_page pointer in it, which is really a layering violation these days,
but you'll notice that it's actually not used very much, and it could
probably be largely removed.

The most fundamental use of it (from an IO standpoint) is actually to
handle high memory issues, because high-memory handling is very
fundamentally based on "struct page", and in order to be able to have
high-memory IO buffers you absolutely have to have the "struct page" the
way things are done now.

(all the other uses tend to not be IO-related at all: they are stuff like
the callbacks that want to find the page that should be free'd up)

The other part of "struct bh" is that it _does_ have support for fast
lookups, and the bh hashing. Again, from a pure IO standpoint you can
easily choose to just ignore this. It's often not used at all (in fact,
_most_ bh's aren't hashed, because the only way to find them are through
the page cache).

> This is not really an clean appropeach, and I would really like to
> get away from it.

Trust me, you really _can_ get away from it. It's not designed into the
bh's at all. You can already just allocate a single (or multiple) "struct
buffer_head" and just use them as IO objects, and give them your _own_
pointers to the IO buffer etc.

In fact, if you look at how the page cache is organized, this is what the
page cache already does. The page cache has it's own IO buffer (the page
itself), and it just uses "struct buffer_head" to allocate temporary IO
entities. It _also_ uses the "struct buffer_head" to cache the meta-data
in the sense of having the buffer head also contain the physical address
on disk so that the page cache doesn't have to ask the low-level
filesystem all the time, so in that sense it actually has a double use for

But you can (and _should_) think of that as a "we got the meta-data
address caching for free, and it fit with our historical use, so why not
use it?".

So you can easily do the equivalent of

 - maintain your own buffers (possibly by looking up pages directly from
   user space, if you want to do zero-copy kind of things)

 - allocate a private buffer head ("get_unused_buffer_head()")

 - make that buffer head point into your buffer

 - submit the IO by just calling "submit_bh()", using the b_end_io()
   callback as your way to maintain _your_ IO buffer ownership.

In particular, think of the things that you do NOT have to do:

 - you do NOT have to allocate a bh-private buffer. Just point the bh at
   your own buffer.
 - you do NOT have to "give" your buffer to the bh. You do, of course,

   want to know when the bh is done with _your_ buffer, but that's what
   the b_end_io callback is all about.

 - you do NOT have to hash the bh you allocated and thus expose it to
   anybody else. It is YOUR private bh, and it does not show up on ANY
   other lists. There are various helper functions to insert the bh on
   various global lists ("mark_bh_dirty()" to put it on the dirty list,
   "buffer_insert_inode_queue()" to put it on the inode lists etc, but
   there is nothing in the thing that _forces_ you to expose your bh.

So don't think of "bh->b_data" as being something that the bh owns. It's
just a pointer. Think of "bh->b_data" and "bh->b_size" as _nothing_ more
than a data range in memory.

In short, you can, and often should, think of "struct buffer_head" as
nothing but an IO entity. It has some support for being more than that,
but that's secondary. That can validly be seen as another layer, that is
just so common that there is little point in splitting it up (and a lot of
purely historical reasons for not splitting it).


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at

This archive was generated by hypermail 2b29 : Wed Feb 07 2001 - 21:00:25 EST