Re: [PATCH net v2] ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data()

From: Willem de Bruijn
Date: Wed Sep 20 2023 - 21:42:19 EST


On Wed, Sep 20, 2023 at 9:54 AM Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
>
> David Howells wrote:
> > Including the transhdrlen in length is a problem when the packet is
> > partially filled (e.g. something like send(MSG_MORE) happened previously)
> > when appending to an IPv4 or IPv6 packet as we don't want to repeat the
> > transport header or account for it twice. This can happen under some
> > circumstances, such as splicing into an L2TP socket.
> >
> > The symptom observed is a warning in __ip6_append_data():
> >
> > WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800
> >
> > that occurs when MSG_SPLICE_PAGES is used to append more data to an already
> > partially occupied skbuff. The warning occurs when 'copy' is larger than
> > the amount of data in the message iterator. This is because the requested
> > length includes the transport header length when it shouldn't. This can be
> > triggered by, for example:
> >
> > sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
> > bind(sfd, ...); // ::1
> > connect(sfd, ...); // ::1 port 7
> > send(sfd, buffer, 4100, MSG_MORE);
> > sendfile(sfd, dfd, NULL, 1024);
> >
> > Fix this by deducting transhdrlen from length in ip{,6}_append_data() right
> > before we clear transhdrlen if there is already a packet that we're going
> > to try appending to.
> >
> > Reported-by: syzbot+62cbf263225ae13ff153@xxxxxxxxxxxxxxxxxxxxxxxxx
> > Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@xxxxxxxxxx/
> > Signed-off-by: David Howells <dhowells@xxxxxxxxxx>
> > cc: Eric Dumazet <edumazet@xxxxxxxxxx>
> > cc: Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx>
> > cc: "David S. Miller" <davem@xxxxxxxxxxxxx>
> > cc: David Ahern <dsahern@xxxxxxxxxx>
> > cc: Paolo Abeni <pabeni@xxxxxxxxxx>
> > cc: Jakub Kicinski <kuba@xxxxxxxxxx>
> > cc: netdev@xxxxxxxxxxxxxxx
> > cc: bpf@xxxxxxxxxxxxxxx
> > cc: syzkaller-bugs@xxxxxxxxxxxxxxxx
> > Link: https://lore.kernel.org/r/75315.1695139973@xxxxxxxxxxxxxxxxxxxxxx/ # v1
> > ---
> > net/ipv4/ip_output.c | 1 +
> > net/ipv6/ip6_output.c | 1 +
> > 2 files changed, 2 insertions(+)
> >
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 4ab877cf6d35..9646f2d9afcf 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -1354,6 +1354,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
> > if (err)
> > return err;
> > } else {
> > + length -= transhdrlen;
> > transhdrlen = 0;
> > }
> >
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 54fc4c711f2c..6a4ce7f622e9 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -1888,6 +1888,7 @@ int ip6_append_data(struct sock *sk,
> > length += exthdrlen;
> > transhdrlen += exthdrlen;
> > } else {
> > + length -= transhdrlen;
> > transhdrlen = 0;
> > }
> >
>
> Definitely a much simpler patch, thanks.
>
> So the current model is that callers with non-zero transhdrlen always
> pass to __ip_append_data payload length + transhdrlen.
>
> I do see that udp does this: ulen += sizeof(struct udphdr); This calls
> ip_make_skb if not corked, but directly ip_append_data if corked.
>
> Then __ip_append_data will use transhdrlen in its packet calculations,
> and reset that to zero after allocating the first new skb.
>
> So if corked *and* fragmentation, which would cause a new skb to be
> allocated, the next skb would incorrectly reserve udp header space,
> because the second __ip_append_data call will again pass transhdrlen.
> If so, then this patch fixes that. But that has never been reported,
> so I'm most likely misreading some part..

This works today because udp only includes transhdrlen if not corked.
In udpv6_sendmsg:

if (up->pending) {
...
goto do_append_data;
}
ulen += sizeof(struct udphdr);

So ip6_append_data is called with ulen == len once data is pending, so
subtracting transhdrlen (which is still sizeof(udphdr)) would not be
correct.

l2tp_ip6_sendmsg more or less follows udpv6_sendmsg, but it
unconditionally sets ulen = len + transhdrlen. So maybe the fix is in
L2TP:

+++ b/net/l2tp/l2tp_ip6.c
@@ -507,7 +507,6 @@ static int l2tp_ip6_sendmsg(struct sock *sk,
struct msghdr *msg, size_t len)
*/
if (len > INT_MAX - transhdrlen)
return -EMSGSIZE;
- ulen = len + transhdrlen;

/* Mirror BSD error message compatibility */
if (msg->msg_flags & MSG_OOB)
@@ -628,6 +627,7 @@ static int l2tp_ip6_sendmsg(struct sock *sk,
struct msghdr *msg, size_t len)

back_from_confirm:
lock_sock(sk);
+ ulen = len + skb_queue_empty(&sk->sk_write_queue) ? transhdrlen : 0;

As said, only raw, udp and l2p can possibly pass MSG_MORE and so cause
secondary invocations of ip6_append_data for the same send. With raw
passing transhdrlen 0, and udp as discussed above, we only have to
consider l2tp.