Re: 2.6.9-rc2 bio sickness with large writes

From: Jens Axboe
Date: Tue Sep 21 2004 - 01:43:38 EST


On Mon, Sep 20 2004, Jeff V. Merkey wrote:
> Jens Axboe wrote:
>
> >>page and offset sematics in the interface are also somewhat burdensome.
> >>Wouldn't a more reasonable
> >>interface for async IO be:
> >>
> >>address
> >>length
> >>address
> >>length
> >>
> >>rather than
> >>
> >>page structure
> >>offset in page structure
> >>page structure
> >>offset in page structure
> >>
> >>
> >
> >No, because { address, length } cannot fully describe all memory in any
> >given machine.
> >
> >
> >
> This response I don't understand. How memory is described in a machine
> for DMA addressibility
> is pretty standard (with the exception of memory on intel machine in 32
> bit systems above 4GBthat need page tables) --
> a physical numerical address. But who is going to DMA into memory not in
> the address space.

The mapped address, yes, not the data you are actually mapping for dma.
On 32-bit intel boxes, even with just 1GB of memory, you don't have a
kernel mapping of the memory above 960MB.

> >Any chunk of memory has a page associated with it, but it may not have a
> >kernel address mapping associated with it. So some identifier was needed
> >other than a virtual address, a page is as good as any so making one up
> >would be silly.
> >
> >Once you understand this, it doesn't seem so odd. You need to pass in a
> >single page or sg table to map for dma anyways, the sg table holds page
> >pointers as well.
> >
> >
> >
> >>I can assume from the interface as designed that if you pass an offset
> >>for a page that is not page aligned,
> >>and ask for a 4K write, then you will end up dropping the data on the
> >>floor than spans beyond the end of the page.
> >>
> >>
> >
> >What kind of bogus example is that? Asking for a 4K write from a 4K page
> >but asking to start 1K in that page is just stupid and not even remotely
> >valid.
> >
> >
> >
> Hardware doesn't care about page boundries. It sees hardware addresses
> and lengths, at least most SG hardware I've worked with does. For ease
> of submission, an interface that takes <address,length> would suffice.

But what if there's no 'address' that identifies the piece of memory you
want to do io for? It might not be directly addressable by the kernel.

The pci dma mapping will coalesce the segments for you in the sg table
according to the hardware restrictions. Most hardware understands sg
addressing, if it doesn't it's a bad design imho.

> Why on earth would someone need a context pointer into the kernel's
> page tables to submit an SG into a device, apart from performing
> virtual-to-physical translation?

Ehm well exactly to do that virtual-to-physical translation, perhaps?
That's the whole point.

> >It's not difficult at all. Apparently you don't understand it so you
> >think it's difficult, that's only natural. But you have access to the
> >page mapping of any given piece of data always, or if you have the
> >virtual address only it's trivial to go to the { page, offset } mapping.
> >
> >
> No, I do understand, and using page/offset at a low level SG interface

You pretty much continue to describe throughout this mail that you do
not understand why that is so.

> IS burdensome. I mean, if this is what I have to support I guess I
> have to use it, but it will be just another section of code where I
> have another **FAT** layer to waste more CPU cycles calculating
> offset/page (oh yeah I have to lookup the struct page * structure
> also) when it would be much simpler to just submit address/len in i386

What a load of crap. Where does this waste of cycles come from?

> systems. With this type of interface, If I have for instance an
> on-disk structure that starts in the middle of a 4K page due to other
> headers, etc. than spans a page, I cannot just submit the address and
> length, I have to break it into two bio requests instead of one with a
> for () loop from hell and calculate the offsets and rumage around in
> memory looking up struct page * addresses.

Calculating an offset is extremely complicated and wasteful, indeed.
It's at least a few cycles.

If you just used the bio_add_page() stuff like you are supposed to, you
would not have to do that either.

> >I can only imagine that you are used to a very different interface on
> >some other OS so you think it's difficult to use. Most of your
> >complaints seem to be based on false assumptions or because you don't
> >understand why certain design decisions were made.
> >
> >
> >
>
> No. I am used to programming to hardware with SG devices that all OS
> use. Is there somewhere a page based
> SG device (other than SCI) for disk drive?. I don't think so, I think
> they operate address/len, address/len, etc.

Where 'address' is the dma address, not the virtual address. You need
the page to do that translation.

--
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/