Re: Pending splice(file -> FIFO) excludes all other FIFO operations forever (was: ... always blocks read(FIFO), regardless of O_NONBLOCK on read side?)

From: Ahelenia Ziemiańska
Date: Sat Jul 08 2023 - 21:03:32 EST


On Sat, Jul 08, 2023 at 01:06:56PM -0700, Linus Torvalds wrote:
> On Fri, 7 Jul 2023 at 17:30, Ahelenia Ziemiańska
> <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > Same reproducer, backtrace attached:
> > $ scripts/faddr2line vmlinux splice_from_pipe_next+0x6e
> > splice_from_pipe_next+0x6e/0x180:
> > pipe_buf_confirm at include/linux/pipe_fs_i.h:233
> Bah. I should have looked more closely at your case.
>
> This is a buffer without an 'ops' pointer. So it looks like was
> already released.
>
> And the reason is that the pipe was readable because there were no
> writers, and I had put the
>
> if (!pipe->writers)
> return 0;
>
> check in splice_from_pipe_next() in the wrong place. It needs to be
> *before* the eat_empty_buffer() call.
>
> Anyway, while I think that fixes your NULL pointer thing,
It does.

> So while fixing your NULL pointer check should be trivial, I think
> that first patch is actually fundamentally broken wrt pipe resizing,
> and I see no really sane way to fix it. We could add a new lock just
> for that, but I don't think it's worth it.
>
> > You are, but, well, that's also the case when the pipe is full.
> > As it stands, the pipe is /empty/ and yet /no-one can write to it/.
> > This is the crux of the issue at hand.
> No, I think you are mis-representing things. The pipe isn't empty.
> It's full of things that just aren't finalized yet.
Being full of no data (as part of some hidden state)
doesn't make it any less empty, but meh; neither here not there.

> > Or, rather: splice() from a non-seekable (non-mmap()pable?)
> Please stop with the non-seekable nonsense.
>
> Any time I see a patch like this:
>
> > + if (!(in->f_mode & FMODE_LSEEK))
> > + return -EINVAL;
>
> I will just go "that person is not competent".
Accurate assessment.

> This has absolutely nothing to do with seekability.
Yes, and as noted, I was using it as a stand-in for "I/O won't
block" due to the above (and splice_direct_to_actor() already uses
it). Glad to see you've managed to synthesise my drivel into something
workable.

> But it is possible that we need to just bite the bullet and say
> "copy_splice_read() needs to use a non-blocking kiocb for the IO".
>
> Of course, that then doesn't work, because while doing this is trivial:
>
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -364,6 +364,7 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
> iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
> init_sync_kiocb(&kiocb, in);
> kiocb.ki_pos = *ppos;
> + kiocb.ki_flags |= IOCB_NOWAIT;
> ret = call_read_iter(in, &kiocb, &to);
>
> if (ret > 0) {
>
> I suspectr you'll find that it makes no difference, because the tty
> layer doesn't actually honor the IOCB_NOWAIT flag for various
> historical reasons.
Indeed, neither when splicing from a tty,
nor from a socket (same setup but socketpair(AF_UNIX, SOCK_STREAM, 0); w.c).

> In fact, the kiocb isn't even passed down to the
> low-level routine, which only gets the 'struct file *', and instead it
> looks at tty_io_nonblock(), which just does that legacy
>
> file->f_flags & O_NONBLOCK
>
> test.
>
> I guess combined with something like
>
> if (!(in->f_mode & FMODE_NOWAIT))
> return -EINVAL;
>
> it might all work.
Yes, that makes the splice instantly -EINVAL for ttys, but doesn't
affect the socketpair() case above, which still blocks forever.

This appears to be because vfs_splice_read() does
if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
return copy_splice_read(in, ppos, pipe, len, flags);
return in->f_op->splice_read(in, ppos, pipe, len, flags);
so the c_s_r() isn't even called for the socket, which uses
unix_stream_splice_read().

Thus,
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 123b35ddfd71..384d5a479b4a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2880,15 +2880,12 @@ static ssize_t unix_stream_splice_read(struct socket *sock, loff_t *ppos,
.pipe = pipe,
.size = size,
.splice_flags = flags,
+ .flags = MSG_DONTWAIT,
};

if (unlikely(*ppos))
return -ESPIPE;

- if (sock->file->f_flags & O_NONBLOCK ||
- flags & SPLICE_F_NONBLOCK)
- state.flags = MSG_DONTWAIT;
-
return unix_stream_read_generic(&state, false);
}
makes the splice -EAGAIN w/o data and splice whatever's in the socket
when there is data.

git grep '\.splice_read.*=' | cut -d= -f2 | tr -s ',;[:space:]' '\n' | sort -u
gives me 27 distinct splice_read implementations and 1 cocci config.

These are simple delegations:
nfs_file_splice_read filemap_splice_read
afs_file_splice_read filemap_splice_read
ceph_splice_read filemap_splice_read
ecryptfs_splice_read_update_atime filemap_splice_read
ext4_file_splice_read filemap_splice_read
f2fs_file_splice_read filemap_splice_read
ntfs_file_splice_read filemap_splice_read
ocfs2_file_splice_read filemap_splice_read
orangefs_file_splice_read filemap_splice_read
v9fs_file_splice_read filemap_splice_read
xfs_file_splice_read filemap_splice_read
zonefs_file_splice_read filemap_splice_read
sock_splice_read copy_splice_read or a socket-specific one
coda_file_splice_read vfs_splice_read
ovl_splice_read vfs_splice_read

vfs_splice_read calls copy_splice_read (not used as a .splice_read).

And the rest are:
01. copy_splice_read you've fixed
02. filemap_splice_read is, as I understand it, only applicable to
files/blockdevs and already has the required semantics?

03. unix_stream_splice_read can be fixed as above

04. fuse_dev_splice_read allocates a buffer and fuse_dev_do_read()s into
it, fuse_dev_do_read does
if (file->f_flags & O_NONBLOCK)
return -EAGAIN;
so this is likely a similarly easy fix
05. tracing_buffers_splice_read, when it doesn't read anything does
ret = -EAGAIN;
if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK))
goto out;
and waits for at least one thing to read;
can probably just goto out instantly
06. tracing_splice_read_pipe delegates to whatever it's tracing, and
errors if that errored, so it's fine(?)

07. shmem_file_splice_read is always nonblocking
08. relay_file_splice_read only checks SPLICE_F_NONBLOCK to turn 0 into
-EAGAIN so I think it also just doesn't block

09. smc_splice_read falls back to an embedded socket's splice_read,
or uses
if (flags & SPLICE_F_NONBLOCK)
flags = MSG_DONTWAIT;
else
flags = 0;
SMC_STAT_INC(smc, splice_cnt);
rc = smc_rx_recvmsg(smc, NULL, pipe, len, flags);
so also probably remove the condition
10. kcm_splice_read:
10a. passes flags (SPLICE_F_...) to skb_recv_datagram(), which does
timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
verbatim!? and forwards them to try_recv which also checks
for socket-style bits
10b. give it MSG_DONTWAIT, call it a day?
10c. it also passes flags to skb_splice_bits() to actually splice to the
pipe, but that discards flags, so no change needed

11. tls_sw_splice_read I don't really understand but it does
err = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK);
and
err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK,
true);
(and skb_splice_bits()) so make them both true?

12. tcp_splice_read uses skb_splice_bits() and timeout governed by
timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
and re-tries on empty read if timeo!=0 (i.e. !(sock->file->f_flags & O_NONBLOCK));
so turning that into true (timeo = 0) and propagating that would
make it behave accd'g to the new semantic

Would that make sense?
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdlib.h>
#include <sys/socket.h>

int main() {
int sp[2];
socketpair(AF_UNIX, SOCK_STREAM, 0, sp);
for(;;)
splice(sp[0], 0, 1, 0, 128 * 1024 * 1024, 0);
}

Attachment: signature.asc
Description: PGP signature