Re: mmap on a device returns ENODEV

Stephen C. Tweedie (sct@redhat.com)
Fri, 10 Dec 1999 13:43:22 +0000 (GMT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Jeffrey B. Siegal: "Re: Windows 9x and RFC1323"
Previous message: Ingo Molnar: "Re: mmap on a device returns ENODEV"

Hi,

On Fri, 10 Dec 1999 07:28:48 -0500 (EST), Ingo Molnar
<mingo@redhat.com> said:

> On Fri, 10 Dec 1999, Stephen C. Tweedie wrote:

>> As soon as we have <4k-blocksize buffers pinned in this way in the
>> page cache, we cannot freely relocate those buffers to be page-aligned
>> with respect to their physical location.

...

> for christ's sake, this is not a RAID5 problem at all. Yes, RAID5 wants to
> snoop clean block cache state (even if in the pagecache), but it does in
> no way interfer in the typical case.

That's not what I'm talking about at all: I agree, completely, that we
can make the raid stuff work just fine by avoiding buffers which
looked as if they were in a transient state.

It's not about the IO itself: it's about the cache. Ingo, I'm
actually worrying about exactly the issues you yourself brought up a
while ago, and which I agree are important: how can raid-5 snoop the
buffer cache if the buffers are in fact in the page cache?

For file data, our buffer_heads are currently not hashed, but in older
versions of the 2.3 cache stuff they used to be. Didn't you want to
see us add that hash back as a way to allow raid-5 to lookup that
cached data?

*That* is the problem. I agree with you that doing the hash is going
to be good for raid-5 performance. However, doing so is going to lock
down hashed buffers in memory in an alignment which does not match the
physical layout of the buffers. You cannot have a buffer hashed
twice: once you have that buffer hashed in the page cache based on
virtual offset, you cannot then hash it into the page cache again
based on physical blocknr.

Calm down, please, I'm not trying to suggest for a moment that the
raid code and the swap/journaling code can't be made to agree
consistent semantics. That is obviously easy to do.

The question is about whether or not all mappings of buffers should be
hashed or not. The advantage of doing so is that raid-5 can find data
in the caches more easily: the disadvantage is that it makes decent
semantics for page-cached block devices that much harder.

The kiobuf issue is a related one: if we end up developing a much more
streamlined API than ll_rw_block, which allows us to pass extents
directly into the request layer, then IO requests will no longer be
made in units of buffer_heads, and it becomes even harder for raid-5
to successfully snoop the page cache. In such a world, the page cache
would quite conceivably be aliased with kiobufs, not with buffer_heads.

> i dont know Stephen what your problem is. I thought we agreed on
> that by defining clean 'dont touch' semantics in the buffer-cache
> the RAID code and journalling code can do it's thing just fine.

Absolutely, there's no problem there. From raid-5's point of view,
the hashing of buffer_heads is at the minute mainly a performance
question --- the correctness issue is easily fixed --- but it is a
_serious_ performance question.

However, we do need to consider the layering: the question of how raid
might do cache snooping is closely tied to the issue about how we
organise the block IO layering, because as soon as we introduce new
APIs to the request layers, we will have to reconsider this whole
snooping question all over again.

So forget about the issue of how you snoop the buffer cache correctly:
we can fix that, it's not a serious problem. The question is how
raid-5 will snoop data in general, given that current 2.3 filesystems
don't hash their buffers and future filesystems may not even use
buffer_heads in their caching; and given that the page-alignment of
buffer heads for block device IO will be incompatible with the hashing
of all page-cached buffer_heads (we'd have buffer_heads for the same
disk block hanging off both the filesystem's inode and the block
device inode).

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jeffrey B. Siegal: "Re: Windows 9x and RFC1323"
Previous message: Ingo Molnar: "Re: mmap on a device returns ENODEV"