Re: splice(-> FIFO) never wakes up inotify IN_MODIFY?

From: Amir Goldstein
Date: Mon Jun 26 2023 - 14:58:07 EST


On Mon, Jun 26, 2023 at 8:14 PM Ahelenia Ziemiańska
<nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Jun 26, 2023 at 07:21:16PM +0300, Amir Goldstein wrote:
> > On Mon, Jun 26, 2023 at 6:12 PM Ahelenia Ziemiańska
> > <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Mon, Jun 26, 2023 at 05:53:46PM +0300, Amir Goldstein wrote:
> > > > > So is it really true that the only way to poll a pipe is a
> > > > > sleep()/read(O_NONBLOCK) loop?
> > > > I don't think so, but inotify is not the way.
> > > So what is? What do the kernel developers recommend as a way to see if a
> > > file is written to, and that file happens to be a pipe?
> > >
> > > FTR, I've opened the symmetric Debian#1039488:
> > > https://bugs.debian.org/1039488
> > > against coreutils, since, if this is expected, and writing to a pipe
> > > should not generate write events on that pipe, then tail -f is currently
> > > broken on most systems.
> > First of all, it is better to mention that you are trying to fix a real
> > world use case when you are reporting a kernel misbehavior.
> I hadn't actually realised this affected coreutils tail as well before
> re-testing it today.
>
> > What this makes me wonder is, if tail -f <fifo> is broken as you claim
> > it is, how is it that decades go by without anyone noticing this problem?
> Most people don't use cat(1) that splice(2)s, I do;
> even if they do, they probably haven't filled the whole buffer so the
> missed splice(2) write was potentially covered by a later write(2) write.
>
> > When looking at tail source code I see:
> >
> > /* Mark as '.ignore'd each member of F that corresponds to a
> > pipe or fifo, and return the number of non-ignored members. */
> > static size_t
> > ignore_fifo_and_pipe (struct File_spec *f, size_t n_files)
> > {
> > /* When there is no FILE operand and stdin is a pipe or FIFO
> > POSIX requires that tail ignore the -f option.
> > Since we allow multiple FILE operands, we extend that to say: with -f,
> > ignore any "-" operand that corresponds to a pipe or FIFO. */
> >
> > and it looks like tail_forever_inotify() is not being called unless
> > there are non pipes:
> >
> > if (forever && ignore_fifo_and_pipe (F, n_files))
> > {
> >
> > The semantics of tail -f on a pipe input would be very odd, because
> > the writer would need to close before tail can figure out which are
> > the last lines.
> The semantics of tail -f for FIFOs are formalised in POSIX, which says
> (Issue 8 Draft 3):
> 115551 −f If the input file is a regular file or if the file operand specifies a FIFO, do not
> 115552 terminate after the last line of the input file has been copied, but read and copy
> 115553 further bytes from the input file when they become available. If no file operand is
> 115554 specified and standard input is a pipe or FIFO, the −f option shall be ignored. If the
> 115555 input file is not a FIFO, pipe, or regular file, it is unspecified whether or not the −f
> 115556 option shall be ignored.
> coreutils sensibly interprets these in accordance with
> https://www.mail-archive.com/austin-group-l@xxxxxxxxxxxxx/msg11402.html
>
> There are no special provisions for pipes/FIFOs before the input is
> exhausted, correct: tail is two programs in one; the first bit reads the
> input(s) to completion and outputs the bit you wanted, the second bit
> (-f) keeps reading the inputs from where the first bit left off.
>
> (Note that tail with -c +123 and -n +123 doesn't "care" what lines are
> last, and just copies from byte/line 123, but that's beside the point.
> Indeed, "tail -c+1 fifo > log" is the idealised log collection program
> from before: many programs may write to fifo, and all output is
> collected in log.)
>
> But yes: tail -f fifo first reads the entire "contents" of fifo
> (where for pipes this is defined as "until all writers hang up"),
> then continues reading fifo and copying whatever it reads.
> On a strict single-file implementation you can get away with reading and
> then sleeping when you get 0 (this is what traditional UNIX tails do).
>
> When you have multiple files, well, you want to poll them, and since
> pipes are unpollable, to avoid waking up every second and reading every
> unpollable input file to see if you got something (regular files, fifos),
> you use inotify(7) (coreutils) or kqueue(2) (NetBSD, probably others)
> to tell you when there's data.
>
> If inotify(7) for pipes worked, the entire coreutils tail -f semantic
> is implementable as a single poll(2):
> * of each individual pollable (sockets, chardevs)
> * of inotify of unpollables (pipes, regular files)
> * of pidfd (if --pid)
> this is very attractive. Naturally, I could also fall back to just a
> poll of pollables and pidfd with a second timeout if there are pipes in
> the inputs, but you see how this is sub-optimal for no real good reason.
> And, well, coreutils tail doesn't do this, so it's vulnerable.
>

Thanks for the explanation.

> > So honestly, we could maybe add IN_ACCESS/IN_MODIFY for the
> > splice_pipe_to_pipe() case, but I would really like to know what
> > the real use case is.
> And splice_file_to_pipe() which is what we're hitting here.
> The real use case is as I said: I would like to be able to poll pipes
> with inotify instead of with sleep()/read().
>
> > Another observation is that splice(2) never used to report any
> > inotify events at all until a very recent commit in v6.4
> > 983652c69199 ("splice: report related fsnotify events")
> > but this commit left out the splice_pipe_to_pipe() case.
> >
> > CC the author of the patch to ask why this case was left
> > out and whether he would be interested in fixing that.
> I'm reading the discussion following
> <20230322062519.409752-1-cccheng@xxxxxxxxxxxx> as
> "people just forget to add inotify hooks to their I/O routines as a rule",
> thus my guess on why it was left out was "didn't even cross my mind"
> (or, perhaps "didn't think we even supported fsnotify for pipes").
>
> Below you'll find a scissor-patch based on current linus HEAD;
> I've tested it works as-expected for both tty-to-pipe and pipe-to-pipe
> splices in my original reproducer.
> -- >8 --
> From: =?UTF-8?q?Ahelenia=20Ziemia=C5=84ska?=
> <nabijaczleweli@xxxxxxxxxxxxxxxxxx>
> Date: Mon, 26 Jun 2023 19:02:28 +0200
> Subject: [PATCH] splice: always fsnotify_access(in), fsnotify_modify(out) on
> success
>
> The current behaviour caused an asymmetry where some write APIs
> (write, sendfile) would notify the written-to/read-from objects,
> but splice wouldn't.
>
> This affected userspace which used inotify, like coreutils tail -f.
>
> Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@xxxxxxxxxxxxxxxxxx>

(Create dependency in case they should be backported)

Fixes: 983652c69199 ("splice: report related fsnotify events")

Makes sense.

Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx>

> ---
> fs/splice.c | 21 ++++++++++++---------
> 1 file changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index 3e06611d19ae..94fae24f9d54 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1154,7 +1154,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
> if ((in->f_flags | out->f_flags) & O_NONBLOCK)
> flags |= SPLICE_F_NONBLOCK;
>
> - return splice_pipe_to_pipe(ipipe, opipe, len, flags);
> + ret = splice_pipe_to_pipe(ipipe, opipe, len, flags);
> + goto notify;
> }
>
> if (ipipe) {
> @@ -1182,15 +1183,12 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
> ret = do_splice_from(ipipe, out, &offset, len, flags);
> file_end_write(out);
>
> - if (ret > 0)
> - fsnotify_modify(out);
> -
> if (!off_out)
> out->f_pos = offset;
> else
> *off_out = offset;
>
> - return ret;
> + goto notify;
> }
>
> if (opipe) {
> @@ -1209,18 +1207,23 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
>
> ret = splice_file_to_pipe(in, opipe, &offset, len, flags);
>
> - if (ret > 0)
> - fsnotify_access(in);
> -
> if (!off_in)
> in->f_pos = offset;
> else
> *off_in = offset;
>
> - return ret;
> + goto notify;
> }
>
> return -EINVAL;
> +
> +notify:
> + if (ret > 0) {
> + fsnotify_access(in);
> + fsnotify_modify(out);
> + }
> +
> + return ret;
> }
>
> static long __do_splice(struct file *in, loff_t __user *off_in,
> --
> 2.39.2
>