Re: [RFC PATCH 0/4] splice: Fix corruption in data spliced to pipe

From: Linus Torvalds
Date: Thu Jun 29 2023 - 14:43:12 EST


On Thu, 29 Jun 2023 at 11:19, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Now, we also have SPLICE_F_GIFT. [..]
>
> Now, I would actually not disagree with removing that part. It's
> scary. But I think we don't really have any users (ok, fuse and some
> random console driver?)

Side note: maybe I should clarify. I have grown to pretty much hate
splice() over the years, just because it's been a constant source of
sorrow in so many ways.

So I'd personally be perfectly ok with just making vmsplice() be
exactly the same as write, and turn all of vmsplice() into just "it's
a read() if the pipe is open for read, and a write if it's open for
writing".

IOW, effectively get rid of vmsplice() entirely, just leaving it as a
legacy name for an interface.

What I *absolutely* don't want to see is to make vmsplice() even more
complicated, and actively slower in the process. Unmapping it from the
source, removing it from the VM, is all just crazy talk.

If you want to be really crazy, I can tell you how to make for some
truly stupendously great benchmarks: make a plain "write()" system
call look up the physical page, check if it's COW'able, and if so,
mark it read-only in the source and steal the page. Now write() has
taken a snapshot of the source, and can use that page for the pipe
buffer as-is. It won't change, because if the user writes to it, the
user will just take a page fault and force a COW.

Then, to complete the thing, make 'read()' of a pipe able to just take
the page, and insert it into the destination VM (it's ok to make it
writable at that point).

You can get *wonderful* performance numbers from benchmarks with that.

I know, because I did exactly that long long ago. So long ago that I
think I had a i486 that had memory throughput measured in megabytes.
And my pipe throughput benchmark got gigabytes per second!

Of course, that benchmark relied entirely on the source of the write()
never actually writing to the page, and the reader never actually
bothering to touch the page. So it was gigabytes on a pretty bad
benchmark. But it was quite impressive.

I don't think those patches ever got posted publicly, because while
very impressive on benchmarks, it obviously was absolutely horrendous
in real life, because in real life the source of the pipe data would
(a) not usually be page-aligned anyway, and (b) even if it was and
triggered this wonderful case, it would then re-use the buffer and
take a COW fault, and now the overhead of faulting, allocating a new
page, copying said page, was obviously higher than just doing all that
in the pipe write() code without any faulting overhead.

But splice() (and vmsplice()) does conceptually come from that kind of
background.

It's just that it was never as lovely and as useful as it promised to
be. So I'd actually be more than happy to just say "let's decommission
splice entirely, just keeping the interfaces alive for backwards
compatibility"

Linus