Re: [PATCH v2] tcp: fix connection reset due to tw hashdance race.

From: Duan,Muquan
Date: Thu Jun 15 2023 - 08:14:40 EST


Hi, Eric,

Could you please help check whether the following method makes sense, and
any side effect? Thanks!

If do the tw hashdance in real TCP_TIME_WAIT state with substate
TCP_TIME_WAIT, instead of in substate TCP_FIN_WAIT2, the connection
refused issue will not occur. The race of the lookup process for the
new SYN segment and the tw hashdance can come to the following
results:
1) get tw sock, SYN segment will be accepted via TCP_TW_SYN.
2) fail to get tw sock and original sock, get a listener sock,
SYN segment will be accepted by listener sock.
3) get original sock, SYN segment can be discarded in further
process after the sock is destroyed, in this case the peer will
retransmit the SYN segment. This is a very rare case, seems no need to
add lock for it.

I tested this modification in my reproducing environment,
the connection reset issue did not occur and no performance impact
observed.The side effect I currently can think out is that the original
sock will live a little longer and hold some resources longer.
The new patch is a temporary version which has a sysctl
switch for comparing the 2 methods, and some modifications
for statistics of the states not included.
I checked the implementation in FreeBSD 13.1, it does the dance in
state TCP_TIMEWAIT.



Regards!
Duan

> 2023年6月8日 下午7:54,Eric Dumazet <edumazet@xxxxxxxxxx> 写道:
>
> On Thu, Jun 8, 2023 at 1:24 PM Duan,Muquan <duanmuquan@xxxxxxxxx> wrote:
>>
>> Besides trying to find the right tw sock, another idea is that if FIN segment finds listener sock, just discard the segment, because this is obvious a bad case, and the peer will retransmit it. Or for FIN segment we only look up in the established hash table, if not found then discard it.
>>
>
> Sure, please give the RFC number and section number that discusses
> this point, and then we might consider this.
>
> Just another reminder about TW : timewait sockets are "best effort".
>
> Their allocation can fail, and /proc/sys/net/ipv4/tcp_max_tw_buckets
> can control their number to 0
>
> Applications must be able to recover gracefully if a 4-tuple is reused too fast.
>
>>
>> 2023年6月8日 下午12:13,Eric Dumazet <edumazet@xxxxxxxxxx> 写道:
>>
>> On Thu, Jun 8, 2023 at 5:59 AM Duan,Muquan <duanmuquan@xxxxxxxxx> wrote:
>>
>>
>> Hi, Eric,
>>
>> Thanks a lot for your explanation!
>>
>> Even if we add reader lock, if set the refcnt outside spin_lock()/spin_unlock(), during the interval between spin_unlock() and refcnt_set(), other cpus will see the tw sock with refcont 0, and validation for refcnt will fail.
>>
>> A suggestion, before the tw sock is added into ehash table, it has been already used by tw timer and bhash chain, we can firstly add refcnt to 2 before adding two to ehash table,. or add the refcnt one by one for timer, bhash and ehash. This can avoid the refcont validation failure on other cpus.
>>
>> This can reduce the frequency of the connection reset issue from 20 min to 180 min for our product, We may wait quite a long time before the best solution is ready, if this obvious defect is fixed, userland applications can benefit from it.
>>
>> Looking forward to your opinions!
>>
>>
>> Again, my opinion is that we need a proper fix, not work arounds.
>>
>> I will work on this a bit later.
>>
>> In the meantime you can apply locally your patch if you feel this is
>> what you want.
>>
>>