Re: [PATCH] fs/splice: don't block splice_direct_to_actor() after data was read

From: Max Kellermann
Date: Wed Sep 20 2023 - 14:16:24 EST


On Wed, Sep 20, 2023 at 7:28 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
> I think adding the flag for this case makes sense, and also exposing it
> on the UAPI side.

OK. I suggest we get this patch merged first, and then I prepare a
patch for wiring this into uapi, changing SPLICE_F_NOWAIT to 0x10 (the
lowest free bit), add it to SPLICE_F_ALL and document it.

(If you prefer to have it all in this initial patch, I can amend and
resubmit it with the uapi feature.)

> My only concern is full coverage of it. We can't
> really have a SPLICE_F_NOWAIT flag that only applies to some cases.

The feature is already part of uapi - via RWF_NOWAIT, which maps to
IOCB_NOWAIT, just like my proposed SPLICE_F_NOWAIT flag. The semantics
(and the concerns) are the same, aren't they?

> That said, asking for a 2G splice, and getting a 2G splice no matter how
> slow it may be, is a bit of a "doctor it hurts when I..." scenario.

I understand this argument, but I disagree. Compare recv(socket) with
read(regular_file).
A read(regular_file) must block until the given buffer is filled
completely (or EOF is reached), which is good for some programs which
do not handle partial reads, but other programs might be happy with a
partial read and prefer lower latency. There is preadv2(RWF_NOWAIT),
but if it returns EAGAIN, userspace cannot know when data will be
available, can't epoll() regular files. There's no way that a read()
returns at least one byte, but doesn't wait for more (not even with
preadv2(), unfortunately).
recv(socket) (or reading on a pipe) behaves differently - it blocks
only until at least one byte arrives, and callers must be able to deal
with partial reads. That's good for latency - imagine recv() would
behave like read(); how much data do you ask the kernel to receive? If
it's too little, you need many system calls; if it's too much, your
process may block indefinitely.

read(regular_file) behaves that way for historical reasons and we
can't change it, only add new APIs like preadv2(); but splice() is a
modern API that we can optimize for how we want it to behave - and
that is: copy as much as the kernel already has, but don't block after
that (in order to avoid huge latencies).

My point is: splice(2G) is a very reasonable thing to do if userspace
wants the kernel to transfer as much as possible with a single system
call, because there's no way for userspace to know what the best
number is, so let's just pass the largest valid value.

Max