Re: [RFC PATCH 1/4] splice: Fix corruption of spliced data after splice() returns

From: Linus Torvalds
Date: Wed Jul 19 2023 - 19:48:50 EST


On Wed, 19 Jul 2023 at 16:20, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> If you want "one-copy", what you can do is:
>
> - mmap() the file data (zero copy, not stable yet)
>
> - use "write()" to write the data to the network. This will copy it
> to the skbs before the write() call returns and that copy makes it
> stable.
>
> Alternatively, if you want to be more than a bit odd, you _can_ do the
> zero-copy on the write side, by doing
>
> - read the file data (one copy, now it's stable)
>
> - vmsplice() to the kernel buffer (zero copy)
>
> - splice() to the network (zero copy at least for the good cases)

Actually, I guess technically there's a third way:

- mmap the input (zero copy)

- write() to a pipe (one copy)

- splice() to the network (zero copy)

which doesn't seem to really have any sane use cases, but who knows...
It avoids the user buffer management of the vmsplice() model, and
while you cannot do anything to the data in user space *before* it is
stable (because it only becomes stable as it is copied to the pipe
buffers by the 'write()' system call), you could use "tee()" to
duplicate the now stable stream and perhaps log it or create a
checksum after-the-fact.

Another use-case would be if you want to send the *same* stable stream
to two different network connections, while still only having one
copy. You can't do that with plain splice() - because the data isn't
guaranteed to be stable, and the two network connections might see
different streams. You can't do that with the 'mmap and then
write-to-socket' approach, because the two writes not only copy twice,
they might copy different data.

And while you *can* do it with the "read+vmsplice()" approach, maybe
the "write to pipe() in order to avoid any user space buffer issues"
model is better. And "tee()" avoids the overhead of doing multiple
vmsplice() calls on the same buffer.

I dunno.

What I *am* trying to say is that "splice()" is actually kind of
designed for people to do these kinds of combinations. But very very
few people actually do it.

For example, the "tee()" system call exists, but it is crazy hard to
use, I'm not sure it has ever actually been used for anything real.

Linus