Re: splice(-> FIFO) never wakes up inotify IN_MODIFY?

From: Ahelenia Ziemiańska
Date: Mon Jun 26 2023 - 13:14:32 EST


On Mon, Jun 26, 2023 at 07:21:16PM +0300, Amir Goldstein wrote:
> On Mon, Jun 26, 2023 at 6:12 PM Ahelenia Ziemiańska
> <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Mon, Jun 26, 2023 at 05:53:46PM +0300, Amir Goldstein wrote:
> > > > So is it really true that the only way to poll a pipe is a
> > > > sleep()/read(O_NONBLOCK) loop?
> > > I don't think so, but inotify is not the way.
> > So what is? What do the kernel developers recommend as a way to see if a
> > file is written to, and that file happens to be a pipe?
> >
> > FTR, I've opened the symmetric Debian#1039488:
> > https://bugs.debian.org/1039488
> > against coreutils, since, if this is expected, and writing to a pipe
> > should not generate write events on that pipe, then tail -f is currently
> > broken on most systems.
> First of all, it is better to mention that you are trying to fix a real
> world use case when you are reporting a kernel misbehavior.
I hadn't actually realised this affected coreutils tail as well before
re-testing it today.

> What this makes me wonder is, if tail -f <fifo> is broken as you claim
> it is, how is it that decades go by without anyone noticing this problem?
Most people don't use cat(1) that splice(2)s, I do;
even if they do, they probably haven't filled the whole buffer so the
missed splice(2) write was potentially covered by a later write(2) write.

> When looking at tail source code I see:
>
> /* Mark as '.ignore'd each member of F that corresponds to a
> pipe or fifo, and return the number of non-ignored members. */
> static size_t
> ignore_fifo_and_pipe (struct File_spec *f, size_t n_files)
> {
> /* When there is no FILE operand and stdin is a pipe or FIFO
> POSIX requires that tail ignore the -f option.
> Since we allow multiple FILE operands, we extend that to say: with -f,
> ignore any "-" operand that corresponds to a pipe or FIFO. */
>
> and it looks like tail_forever_inotify() is not being called unless
> there are non pipes:
>
> if (forever && ignore_fifo_and_pipe (F, n_files))
> {
>
> The semantics of tail -f on a pipe input would be very odd, because
> the writer would need to close before tail can figure out which are
> the last lines.
The semantics of tail -f for FIFOs are formalised in POSIX, which says
(Issue 8 Draft 3):
115551 −f If the input file is a regular file or if the file operand specifies a FIFO, do not
115552 terminate after the last line of the input file has been copied, but read and copy
115553 further bytes from the input file when they become available. If no file operand is
115554 specified and standard input is a pipe or FIFO, the −f option shall be ignored. If the
115555 input file is not a FIFO, pipe, or regular file, it is unspecified whether or not the −f
115556 option shall be ignored.
coreutils sensibly interprets these in accordance with
https://www.mail-archive.com/austin-group-l@xxxxxxxxxxxxx/msg11402.html

There are no special provisions for pipes/FIFOs before the input is
exhausted, correct: tail is two programs in one; the first bit reads the
input(s) to completion and outputs the bit you wanted, the second bit
(-f) keeps reading the inputs from where the first bit left off.

(Note that tail with -c +123 and -n +123 doesn't "care" what lines are
last, and just copies from byte/line 123, but that's beside the point.
Indeed, "tail -c+1 fifo > log" is the idealised log collection program
from before: many programs may write to fifo, and all output is
collected in log.)

But yes: tail -f fifo first reads the entire "contents" of fifo
(where for pipes this is defined as "until all writers hang up"),
then continues reading fifo and copying whatever it reads.
On a strict single-file implementation you can get away with reading and
then sleeping when you get 0 (this is what traditional UNIX tails do).

When you have multiple files, well, you want to poll them, and since
pipes are unpollable, to avoid waking up every second and reading every
unpollable input file to see if you got something (regular files, fifos),
you use inotify(7) (coreutils) or kqueue(2) (NetBSD, probably others)
to tell you when there's data.

If inotify(7) for pipes worked, the entire coreutils tail -f semantic
is implementable as a single poll(2):
* of each individual pollable (sockets, chardevs)
* of inotify of unpollables (pipes, regular files)
* of pidfd (if --pid)
this is very attractive. Naturally, I could also fall back to just a
poll of pollables and pidfd with a second timeout if there are pipes in
the inputs, but you see how this is sub-optimal for no real good reason.
And, well, coreutils tail doesn't do this, so it's vulnerable.

> So honestly, we could maybe add IN_ACCESS/IN_MODIFY for the
> splice_pipe_to_pipe() case, but I would really like to know what
> the real use case is.
And splice_file_to_pipe() which is what we're hitting here.
The real use case is as I said: I would like to be able to poll pipes
with inotify instead of with sleep()/read().

> Another observation is that splice(2) never used to report any
> inotify events at all until a very recent commit in v6.4
> 983652c69199 ("splice: report related fsnotify events")
> but this commit left out the splice_pipe_to_pipe() case.
>
> CC the author of the patch to ask why this case was left
> out and whether he would be interested in fixing that.
I'm reading the discussion following
<20230322062519.409752-1-cccheng@xxxxxxxxxxxx> as
"people just forget to add inotify hooks to their I/O routines as a rule",
thus my guess on why it was left out was "didn't even cross my mind"
(or, perhaps "didn't think we even supported fsnotify for pipes").

Below you'll find a scissor-patch based on current linus HEAD;
I've tested it works as-expected for both tty-to-pipe and pipe-to-pipe
splices in my original reproducer.
-- >8 --
From: =?UTF-8?q?Ahelenia=20Ziemia=C5=84ska?=
<nabijaczleweli@xxxxxxxxxxxxxxxxxx>
Date: Mon, 26 Jun 2023 19:02:28 +0200
Subject: [PATCH] splice: always fsnotify_access(in), fsnotify_modify(out) on
success

The current behaviour caused an asymmetry where some write APIs
(write, sendfile) would notify the written-to/read-from objects,
but splice wouldn't.

This affected userspace which used inotify, like coreutils tail -f.

Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@xxxxxxxxxxxxxxxxxx>
---
fs/splice.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..94fae24f9d54 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1154,7 +1154,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
if ((in->f_flags | out->f_flags) & O_NONBLOCK)
flags |= SPLICE_F_NONBLOCK;

- return splice_pipe_to_pipe(ipipe, opipe, len, flags);
+ ret = splice_pipe_to_pipe(ipipe, opipe, len, flags);
+ goto notify;
}

if (ipipe) {
@@ -1182,15 +1183,12 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
ret = do_splice_from(ipipe, out, &offset, len, flags);
file_end_write(out);

- if (ret > 0)
- fsnotify_modify(out);
-
if (!off_out)
out->f_pos = offset;
else
*off_out = offset;

- return ret;
+ goto notify;
}

if (opipe) {
@@ -1209,18 +1207,23 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,

ret = splice_file_to_pipe(in, opipe, &offset, len, flags);

- if (ret > 0)
- fsnotify_access(in);
-
if (!off_in)
in->f_pos = offset;
else
*off_in = offset;

- return ret;
+ goto notify;
}

return -EINVAL;
+
+notify:
+ if (ret > 0) {
+ fsnotify_access(in);
+ fsnotify_modify(out);
+ }
+
+ return ret;
}

static long __do_splice(struct file *in, loff_t __user *off_in,
--
2.39.2

Attachment: signature.asc
Description: PGP signature