Re: [PATCH] tracing: Fix race when concurrently splice_read trace_pipe

From: Linus Torvalds
Date: Sun Aug 13 2023 - 12:42:00 EST


On Sat, 12 Aug 2023 at 18:13, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> So the test case cannot be run because the "sendfile" on the
> trace_pipe now fails?

So sendfile() has always required a seekable source, because it -
intentionally - doesn't do the "splice to pipe and then splice from
pipe to destination".

It just does a "splice from source, splice result to destination".

That sounds like "obviously the right thing to do", but it means that
there is now no buffer with any extended lifetime between the two
operations. And that's a big deal.

Without that buffer (ie pipe) in the middle, if the splice to
destination fails - or is partial - and the system call is interrupted
by a signal, then the source splice data is now gone, gone, gone.

So it only works if the source can then re-create the data - ie if the
source is seekable.

In that case, if the destination cannot accept all the data, the
sendfile can just go back and read again from the last successfully
written position.

And if you are a data stream that can't seek, then that "from last
successfully written position" doesn't work, and any partial writes
will just have dropped the data.

This seekability test *used* to be to test that the source was either
a regular file or a block device.

Now it literally checks "can I seek".

It looks like the trace_pipe fiel is *claiming* to be a regular file -
so sendfile() used to be ok with it - but cannot actually seek - so
sendfile would silently drop data.

Now sendfile says "I'm sorry, Dave, I'm afraid I can't do that" rather
than silently causing data loss.

Now, this is not a user-visible regression in this case, because "cat"
will always fall back on just regular read/write when sendfile fails.

So all that changed for 'trace_pipe' is that a buggy splice now no
longer triggers the bug that it used to (which the patch in question
was also trying to fix).

End result: things in many ways work better this way.

So it really looks like it never really worked before either. There
was *both* the dropped data bug because "trace_pipe" lied about being
a regular file, *and* apparently this trace_pipe race bug that Zheng
Yejian tried to fix.

Of course, some destinations always accept a full write, so maybe we
could relax the "source must be seekable" to be "source must be
seekable, _OR_ destination must be something that never returns
partial writes".

So sendfile to a POSIX-compliant regular file could always work
(ignoring filesystem full situations, and at that point I think we can
say "yeah, we're done, no recovering from that in sendfile()").

So if somebody really *really* want sendfile to work for this case, then you

(a) do need to fix the race in tracing_splice_read_pipe (which you
should do anyway, since you can obviously always use splice() itself,
not sendfile()).

AND

(b) figure out when 'splice_write()' will always succeed fully and
convince people that we can do that "extended version" of sendfile().

Hmm?

Linus