Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

From: Eric Dumazet
Date: Tue Jun 06 2023 - 03:07:51 EST


On Tue, Jun 6, 2023 at 8:43 AM Duan Muquan <duanmuquan@xxxxxxxxx> wrote:
>
> If the FIN from passive closer and the ACK for active closer's FIN are
> processed on different CPUs concurrently, tw hashdance race may occur.
> On loopback interface, transmit function queues a skb to current CPU's
> softnet's input queue by default. Suppose active closer runs on CPU 0,
> and passive closer runs on CPU 1. If the ACK for the active closer's
> FIN is sent with no delay, it will be processed and tw hashdance will
> be done on CPU 0; The passive closer's FIN will be sent in another
> segment and processed on CPU 1, it may fail to find tw sock in the
> ehash table due to tw hashdance on CPU 0, then get a RESET.
> If application reconnects immediately with the same source port, it
> will get reset because tw sock's tw_substate is still TCP_FIN_WAIT2.
>
> The dmesg to trace down this issue:
>
> .333516] tcp_send_fin: sk 0000000092105ad2 cookie 9 cpu 3
> .333524] rcv_state_process:FIN_WAIT2 sk 0000000092105ad2 cookie 9 cpu 3
> .333534] tcp_close: tcp_time_wait: sk 0000000092105ad2 cookie 9 cpu 3
> .333538] hashdance: tw 00000000690fdb7a added to ehash cookie 9 cpu 3
> .333541] hashdance: sk 0000000092105ad2 removed cookie 9 cpu 3
> .333544] __inet_lookup_established: Failed the refcount check:
> !refcount_inc_not_zero 00000000690fdb7a ref 0 cookie 9 cpu 0
> .333549] hashdance: tw 00000000690fdb7a before add ref 0 cookie 9 cpu 3
> .333552] rcv_state: RST for FIN listen 000000003c50afa6 cookie 0 cpu 0
> .333574] tcp_send_fin: sk 0000000066757bf8 ref 2 cookie 0 cpu 0
> .333611] timewait_state: TCP_TW_RST tw 00000000690fdb7a cookie 9 cpu 0
> .333626] tcp_connect: sk 0000000066757bf8 cpu 0 cookie 0
>
> Here is the call trace map:
>
> CPU 0 CPU 1
>
> -------- --------
> tcp_close()
> tcp_send_fin()
> loopback_xmit()
> netif_rx()
> tcp_v4_rcv()
> tcp_ack_snd_check()
> loopback_xmit
> netif_rx() tcp_close()
> ... tcp_send_fin()
> loopback_xmit()
> netif_rx()
> tcp_v4_rcv()
> ...
> tcp_time_wait()
> inet_twsk_hashdance() {
> ...
> <-__inet_lookup_established()
> (bad case (1), find sk, may fail tw_refcnt check)
> inet_twsk_add_node_tail_rcu(tw, ...)
> <-__inet_lookup_established()
> (bad case (1), find sk, may fail tw_refcnt check)
>
> __sk_nulls_del_node_init_rcu(sk)
> <-__inet_lookup_established()
> (bad case (2), find tw, may fail tw_refcnt check)
> refcount_set(&tw->tw_refcnt, 3)
> <-__inet_lookup_established()
> (good case, find tw, tw_refcnt is not 0)
> ...
> }
>
> This issue occurs with a small probability on our application working
> on loopback interface, client gets a connection refused error when it
> reconnects. In reproducing environments on kernel 4.14,5.10 and
> 6.4-rc1, modify tcp_ack_snd_check() to disable delay ack all the
> time; Let client connect server and server sends a message to client
> then close the connection; Repeat this process forever; Let the client
> bind the same source port every time, it can be reproduced in about 20
> minutes.
>
> Brief of the scenario:
>
> 1. Server runs on CPU 0 and Client runs on CPU 1. Server closes
> connection actively and sends a FIN to client. The lookback's driver
> enqueues the FIN segment to backlog queue of CPU 0 via
> loopback_xmit()->netif_rx(), one of the conditions for non-delay ack
> meets in __tcp_ack_snd_check(), and the ACK is sent immediately.
>
> 2. On loopback interface, the ACK is received and processed on CPU 0,
> the 'dance' from original sock to tw sock will perfrom, tw sock will
> be inserted to ehash table, then the original sock will be removed.
>
> 3. On CPU 1, client closes the connection, a FIN segment is sent and
> processed on CPU 1. When it is looking up sock in ehash table (with no
> lock), tw hashdance race may occur, it fails to find the tw sock and
> get a listener sock in the flowing 3 cases:
>
> (1) Original sock is found, but it has been destroyed and sk_refcnt
> has become 0 when validating it.
> (2) tw sock is found, but its tw_refcnt has not been set to 3, it is
> still 0, validating for sk_refcnt will fail.
> (3) For versions without Eric and Jason's commit(3f4ca5fafc08881d7a5
> 7daa20449d171f2887043), tw sock is added to the head of the list.
> It will be missed if the list is traversed before tw sock is
> added. And if the original sock is removed before it is found, no
> established sock will be found.
>
> The listener sock will reset the FIN segment which has ack bit set.
>
> 4. If client reconnects immediately and is assigned with the same
> source port as previous connection, the tw sock with tw_substate
> TCP_FIN_WAIT2 will reset client's SYN and destroy itself in
> inet_twsk_deschedule_put(). Application gets a connection refused
> error.
>
> 5. If client reconnects again, it will succeed.
>
> Introduce the flowing 2 modifications to solve the above 3 bad cases:
>
> For bad case (2):
> Set tw_refcnt to 3 before adding it into list.
>
> For bad case (1):
> In function tcp_v4_rcv(), if __inet_lookup_skb() returns a listener
> sock and this segment has FIN bit set, then retry the lookup process
> one time. This fix can cover bad case (3) for the versions without
> Eric and Jason's fix.
>
> There may be another bad case, if the original sock is found and passes
> validation, but during further process for the passive closer's FIN on
> CPU 1, the sock has been destroyed on CPU 0, then the FIN segment will
> be dropped and retransmitted. This case does not hurt application as
> much as resetting reconnection, and this case has less possibility than
> the other bad cases, it does not occur on our product and in
> experimental environment, so it is not considered in this patch.
>
> Could you please check whether this fix is OK, or any suggestions?
> Looking forward for your precious comments!
>
> Signed-off-by: Duan Muquan <duanmuquan@xxxxxxxxx>
> ---
> net/ipv4/inet_timewait_sock.c | 15 +++++++--------
> net/ipv4/tcp_ipv4.c | 13 +++++++++++++
> 2 files changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 40052414c7c7..ed1f255c9aa8 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -144,14 +144,6 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>
> spin_lock(lock);
>
> - inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> -
> - /* Step 3: Remove SK from hash chain */
> - if (__sk_nulls_del_node_init_rcu(sk))
> - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> -
> - spin_unlock(lock);
> -
> /* tw_refcnt is set to 3 because we have :
> * - one reference for bhash chain.
> * - one reference for ehash chain.
> @@ -162,6 +154,13 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> * so we are not allowed to use tw anymore.
> */
> refcount_set(&tw->tw_refcnt, 3);
> + inet_twsk_add_node_tail_rcu(tw, &ehead->chain);
> +
> + /* Step 3: Remove SK from hash chain */
> + if (__sk_nulls_del_node_init_rcu(sk))
> + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> +
> + spin_unlock(lock);
> }
> EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 06d2573685ca..3e3cef202f76 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2018,6 +2018,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
> sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> skb, __tcp_hdrlen(th), th->source,
> th->dest, sdif, &refcounted);
> +
> + /* If tw "dance" is performed on another CPU, the lookup process may find
> + * no tw sock for the passive closer's FIN segment, but a listener sock,
> + * which will reset the FIN segment. If application reconnects immediately
> + * with the same source port, it will get reset because the tw sock's
> + * tw_substate is still TCP_FIN_WAIT2. Try to get the tw sock in another try.
> + */
> + if (unlikely(th->fin && sk && sk->sk_state == TCP_LISTEN)) {
> + sk = __inet_lookup_skb(net->ipv4.tcp_death_row.hashinfo,
> + skb, __tcp_hdrlen(th), th->source,
> + th->dest, sdif, &refcounted);
> + }
> +

I do not think this fixes anything, there is no barrier between first
and second lookup.
This might reduce race a little bit, but at the expense of extra code
in the fast path.

If you want to fix this properly, I think we need to revisit handling
for non established sockets,
to hold the bucket spinlock over the whole thing.

This will be slightly more complex than this...