Re: Bug in short splice to socket?

From: David Howells
Date: Fri Jun 02 2023 - 04:24:41 EST


Jakub Kicinski <kuba@xxxxxxxxxx> wrote:

> On Thu, 01 Jun 2023 18:14:40 +0100 David Howells wrote:
> > The answer then might be to make TLS handle a zero-length send()
>
> IDK. Eric added MSG_SENDPAGE_NOTLAST 11 years ago, to work around
> this exact problem. Your refactoring happens to break it and what
> you're saying sounds to me more or less like "MSG_SENDPAGE_NOTLAST
> is unnecessary, it's user's fault".
>
> A bit unconvincing. Maybe Eric would chime in, I'm not too familiar
> with the deadly mess of the unchecked sendmsg()/sendpage() flags.

Not so much the "user's fault" as we couldn't fulfill what the user asked - so
should we leave it to the user to work out how to clean it up rather than
automatically allowing the socket to flush (if cancellation might be an option
instead)?

The problem I have with NOTLAST is that it won't be asserted if the short read
only occupies a single pipe_buf. We don't know that we won't get some more
data on the next pass.

An alternative way to maintain the current behaviour might be to have
splice_direct_to_actor() call the actor with sd->total_len == 0 if
do_splice_to() returned 0 and SPLICE_F_MORE wasn't set by the caller
(ie. !more). Attached is a change to do that. It becomes simpler if/once my
splice_to_socket() patches are applied - but I don't really want to push that
until all the protocols that support sendpage() support sendmsg() +
MSG_SPLICE_PAGES as well[*].

[*] Though if you're okay having a small window where TLS copies data rather
than splicing, I could push the splice_to_socket() patches *first*. TCP
and AF_UNIX splice already support MSG_SPLICE_PAGES so that would bump
their efficiency.

David
---
diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..84e9ca06db47 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -643,6 +643,22 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
wakeup_pipe_writers(pipe);
}

+/*
+ * Pass a zero-length record to the splice-write actor with SPLICE_F_MORE
+ * turned off to allow the network to see MSG_MORE deasserted.
+ */
+static ssize_t splice_from_pipe_zero(struct pipe_inode_info *pipe,
+ struct splice_desc *sd,
+ splice_actor *actor)
+{
+ struct pipe_buffer buf = {
+ .page = ZERO_PAGE(0),
+ .ops = &nosteal_pipe_buf_ops,
+ };
+
+ return actor(pipe, &buf, sd);
+}
+
/**
* __splice_from_pipe - splice data from a pipe to given actor
* @pipe: pipe to splice from
@@ -662,6 +678,9 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
int ret;

splice_from_pipe_begin(sd);
+ if (!sd->total_len)
+ return splice_from_pipe_zero(pipe, sd, actor);
+
do {
cond_resched();
ret = splice_from_pipe_next(pipe, sd);
@@ -956,13 +975,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
*/
bytes = 0;
len = sd->total_len;
+
+ /* Don't block on output, we have to drain the direct pipe. */
flags = sd->flags;
+ sd->flags &= ~SPLICE_F_NONBLOCK;

/*
- * Don't block on output, we have to drain the direct pipe.
+ * We signal MORE until we've read sufficient data to fulfill the
+ * request and we keep signalling it if the caller set it.
*/
- sd->flags &= ~SPLICE_F_NONBLOCK;
more = sd->flags & SPLICE_F_MORE;
+ sd->flags |= SPLICE_F_MORE;

WARN_ON_ONCE(!pipe_empty(pipe->head, pipe->tail));

@@ -971,21 +994,19 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
loff_t pos = sd->pos, prev_pos = pos;

ret = do_splice_to(in, &pos, pipe, len, flags);
- if (unlikely(ret <= 0))
+ if (unlikely(ret < 0))
goto out_release;

read_len = ret;
sd->total_len = read_len;

/*
- * If more data is pending, set SPLICE_F_MORE
- * If this is the last data and SPLICE_F_MORE was not set
- * initially, clears it.
+ * If we now have sufficient data to fulfill the request then
+ * we clear SPLICE_F_MORE if it was not set initially.
*/
- if (read_len < len)
- sd->flags |= SPLICE_F_MORE;
- else if (!more)
+ if ((read_len == 0 || read_len >= len) && !more)
sd->flags &= ~SPLICE_F_MORE;
+
/*
* NOTE: nonblocking mode only applies to the input. We
* must not do the output in nonblocking mode as then we
@@ -1005,6 +1026,8 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
sd->pos = prev_pos + ret;
goto out_release;
}
+ if (read_len < 0)
+ goto out_release;
}

done:
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f63e4405cf34..5d48391da16c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -995,6 +995,9 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
}
}

+ if (!msg_data_left(msg) && eor)
+ goto copied;
+
while (msg_data_left(msg)) {
if (sk->sk_err) {
ret = -sk->sk_err;