Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

From: Ahelenia Ziemiańska
Date: Wed Jun 28 2023 - 13:09:23 EST


On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote:
> On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska
> <nabijaczleweli@xxxxxxxxxxxxxxxxxx> wrote:
> > The current behaviour caused an asymmetry where some write APIs
> > (write, sendfile) would notify the written-to/read-from objects,
> > but splice wouldn't.
> >
> > This affected userspace which uses inotify, most notably coreutils
> > tail -f, to monitor pipes.
> > If the pipe buffer had been filled by a splice-family function:
> > * tail wouldn't know and thus wouldn't service the pipe, and
> > * all writes to the pipe would block because it's full,
> > thus service was denied.
> > (For the particular case of tail -f this could be worked around
> > with ---disable-inotify.)
> Is my understanding of the tail code wrong?
> My understanding was that tail_forever_inotify() is not called for
> pipes, or is it being called when tailing a mixed collection of pipes
> and regular files? If there are subtleties like those you need to
> mention them , otherwise people will not be able to reproduce the
> problem that you are describing.
I can't squeak to the code itself, but it's trivial to check:
$ tail -f fifo &
[1] 3213996
$ echo zupa > fifo
zupa
$ echo zupa > fifo
zupa
$ echo zupa > fifo
zupa
$ cat /bin/tail > fifo
# ...
$ cat /bin/tail > fifo
hangs: the fifo is being watched with inotify.

This happens regardless of other files being specified.

tail -f doesn't follow FIFOs or pipes if they're fd 0
(guaranteed by POSIX, coreutils conforms).

OTOH, you could theoretically do
$ cat | tail -f /dev/fd/3 3<&0
which first reads from the pipe until completion (⇔ hangup, cat died),
then hangs, because it's waiting for more data on the pipe.

This can never happen under a normal scenario, but doing
$ echo zupa > /proc/3238590/fd/3
a few times reveals it's using classic 1/s polling
(and splicing to /proc/3238590/fd/3 actually yields that data being
output from tail).

> I need to warn you about something regarding this patch -
> often there are colliding interests among different kernel users -
> fsnotify use cases quite often collide with the interest of users tracking
> performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes
> specifically have been the source of several performance regression reports
> in the past and have driven optimizations like:
>
> 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead
> when there is no watcher")
> e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
>
> The moral of this story is: even if your patches are accepted by fsnotify
> reviewers, once they are staged for merging they will be subject to
> performance regression tests and I can tell you with certainty that
> performance regression will not be tolerated for the tail -f use case.
> I will push your v4 patches to a branch in my github, to let the kernel
> test bots run the performance regressions on it whenever they get to it.
>
> Moreover, if coreutils will change tail -f to start setting inotify watches
> on anonymous pipes (my understanding is that currently does not?),
> then any tail -f on anonymous pipe can cripple the "no marks on sb"
> performance optimization for all anonymous pipes and that would be
> a *very* unfortunate outcome.
As seen above, it doesn't set inotify watches on anon pipes, and
(since it manages to distinguish "| /dev/fd/3 3<&0" from "fifo",
so it must be going further than S_ISFIFO(fstat()))
this is an explicit design decision.

If you refuse setting inotifies on anon pipes then that likely won't
impact any userspace program (it's pathological, and for tail-like cases
it'd only be meaningful for magic /proc/$pid/fd/* symlinks),
and if it's in the name of performance then no-one'll likely complain,
or even notice.

> I think we need to add a rule to fanotify_events_supported() to ban
> sb/mount marks on SB_KERNMOUNT and backport this
> fix to LTS kernels (I will look into it) and then we can fine tune
> the s_fsnotify_connectors optimization in fsnotify_parent() for
> the SB_KERNMOUNT special case.
> This may be able to save your patch for the faith of NACKed
> for performance regression.
This goes over my head, but if Jan says it makes sense
then it must do.

> > Generate modify out before access in to let inotify merge the
> > modify out events in thr ipipe case.
> This comment is not clear and does not belong in this context,
> but it very much belongs near the code in question.
Turned it into
/*
* Generate modify out before access in:
* do_splice_from() may've already sent modify out,
* and this ensures the events get merged.
*/
for v5.

Attachment: signature.asc
Description: PGP signature