Re: Linux 2.6.17-rc2

From: Jens Axboe
Date: Thu Apr 20 2006 - 15:19:22 EST


On Thu, Apr 20 2006, Linus Torvalds wrote:
>
>
> On Thu, 20 Apr 2006, Jens Axboe wrote:
> >
> > > - an ioctl/fcntl to set the maximum size of the buffer. Right now it's
> > > hardcoded to 16 "buffer entries" (which in turn are normally limited to
> > > one page each, although there's nothing that _requires_ that a buffer
> > > entry always be a page).
> >
> > This is on a TODO, but not very high up since I've yet to see a case
> > where the current 16 page limitation is an issue. I'm sure something
> > will come up eventually, but until then I'd rather not bother.
>
> The real reason for limiting the number of buffer entries is not to make
> the number _larger_ (although that can be a performance optimization), but
> to make it _smaller_ or at least knowing/limiting how big it is.
>
> It doesn't matter with the current interfaces which are mostly agnostic as
> to how big the buffer is, but it _does_ matter with vmsplice().
>
> Why?
>
> Simple: for a vmsplice() user, it's very important to know when they can
> start re-using the buffer(s) that they used vmsplice() on previously. And
> while the user could just ask the kernel how many bytes are left in the
> pipe buffer, that's pretty inefficient for many normal streaming cases.
>
> The _efficient_ way is to make the user-space buffer that you use for
> splicing information to another entity a circular buffer that is at least
> as large as any of the splice pipes involved in the transfer (depending on
> use. In many cases, you will probably want to make the user-space buffer
> _twice_ as big as the kernel buffer, which makes the tracking even easier:
> while half of the buffer is busy, you can write to the half that is
> guaranteed to not be in the kernel buffer, so you effectively do "double
> buffering")
>
> So if you do that, then you can continue to write to the buffer without
> ever worrying about re-use, because you know that by the time you wrap
> around, the kernel buffer will have been flushed out, or the vmsplice()
> would have blocked, waiting for the receiver. So now you no longer need to
> worry about "how much has flushed" - you only need to worry about doing
> the vmsplice() call at least twice per buffer traversal (assuming the
> "user buffer is double the size of the kernel buffer" approach).
>
> So you could do a very efficient "stdio-like" implementation for logging,
> for example, since this allows you to re-use the same pages over and over
> for splicing, without ever having any copying overhead, and without ever
> having to play VM-related games (ie you don't need to do unmaps or
> mprotects or anything expensive like that in order to get a new page or
> something).
>
> But in order to do that, you really do need to know (and preferably set)
> the size of the splice buffer. Otherwise, if the in-kernel splice buffer
> is larger than the circular buffer you use in user space, the kernel will
> add the same page _twice_ to the buffer, and you'll overwrite the data
> that you already spliced.

Good point, as you can tell I had other uses in mind for this. I'd
prefer using fcntl for this instead of an ioctl - how about a set of
matching F_SETPIPESZ/F_GETPIPESZ or something in that order? Right now
we can just -EINVAL stub the pipe size setting, but really implement the
pipe size getting.

Other suggestions?

The vmsplice addition itself is pretty slim:

splice.c | 163 +++++++++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 132 insertions(+), 31 deletions(-)

> (Now, you still need to be very careful with vmsplice() in general, since
> it leaves the data page writable in the source VM and thus allows for all
> kinds of confusion, but the theory here is "give them rope". Rope enough
> to do clever things always ends up being rope enough to hang yourself too.
> Tough.).

Oh definitely :-)

--
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/