Re: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced

From: Jason Xing
Date: Tue Feb 15 2022 - 11:02:03 EST


On Tue, Feb 15, 2022 at 11:25 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@xxxxxxxxx> wrote:
> >
> > From: Jason Xing <xingwanli@xxxxxxxxxxxx>
> >
> > Normally, user doesn't care the logic behind the kernel if they're
> > trying to set receive buffer via setsockopt. However, if the new value
> > of the receive buffer is not smaller than the initial value which is
> > sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's
> > wscale will shrink and then lead to the bad bandwidth. I think it is
> > not appropriate.
>
> Then do not use SO_RCVBUF ?
>
> It is working as intended really.
>
> >
> > Here are some numbers:
> > $ sysctl -a | grep rmem
> > net.core.rmem_default = 212992
> > net.core.rmem_max = 40880000
> > net.ipv4.tcp_rmem = 4096 425984 40880000
> >
> > Case 1
> > on the server side
> > # iperf -s -p 5201
> > on the client side
> > # iperf -c [client ip] -p 5201
> > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
> > server side is 10. It's good.
> >
> > Case 2
> > on the server side
> > #iperf -s -p 5201 -w 425984
> > on the client side
> > # iperf -c [client ip] -p 5201
> > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
> > wcale is 2, even though the receive buffer is not changed at all at the
> > very beginning.
>
> Great, you discovered auto tuning is working as intended.
>

Thanks.

> >
> > Therefore, I added one condition where only user is trying to set a
> > smaller rx buffer. After this patch is applied, the bandwidth of case 2
> > is recovered to 9.34 Gbits/sec.
> >
> > Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
>
> This commit has nothing to do with your patch or feature.
>
> > Signed-off-by: Jason Xing <xingwanli@xxxxxxxxxxxx>
> > ---
> > net/core/filter.c | 7 ++++---
> > net/core/sock.c | 8 +++++---
> > 2 files changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 4603b7c..99f5d9c 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> > case SO_RCVBUF:
> > val = min_t(u32, val, sysctl_rmem_max);
> > val = min_t(int, val, INT_MAX / 2);
> > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > - WRITE_ONCE(sk->sk_rcvbuf,
> > - max_t(int, val * 2, SOCK_MIN_RCVBUF));
> > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > + WRITE_ONCE(sk->sk_rcvbuf, val);
> > break;
> > case SO_SNDBUF:
> > val = min_t(u32, val, sysctl_wmem_max);
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 4ff806d..e5e9cb0 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
> > * as a negative value.
> > */
> > val = min_t(int, val, INT_MAX / 2);
> > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > -
> > /* We double it on the way in to account for "struct sk_buff" etc.
> > * overhead. Applications assume that the SO_RCVBUF setting they make
> > * will allow that much actual data to be received on that socket.
> > @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
> > * And after considering the possible alternatives, returning the value
> > * we actually used in getsockopt is the most desirable behavior.
> > */
> > - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
> > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> > +
> > + WRITE_ONCE(sk->sk_rcvbuf, val);
> > }
> >
> > void sock_set_rcvbuf(struct sock *sk, int val)
>
> You are breaking applications that want to set sk->sk_rcvbuf to a fixed value,
> to control memory usage on millions of active sockets in a host.
>
> I think that you want new functionality, with new SO_ socket options,
> targeting net-next tree (No spurious FIxes: tag)
>

I agree with you.

> For instance letting an application set or unset SOCK_RCVBUF_LOCK
> would be more useful, and would not break applications.

I would like to change the title of the v2 patch while the title of
this version is a little bit misleading, say, "net: introduce
SO_RCVBUFAUTO to unset SOCK_RCVBUF_LOCK flag".

Eric, what do you think of the new name of this socket option, like
SO_RCVBUFAUTO?

Thanks,
Jason