Re: [RFC PATCH net v2 1/2] net/smc: Resolve the race between link group access and termination

From: Wen Gu
Date: Thu Jan 06 2022 - 08:02:41 EST


Thanks for your reply.

On 2022/1/5 8:03 pm, Karsten Graul wrote:
On 05/01/2022 09:27, Wen Gu wrote:
On 2022/1/3 6:36 pm, Karsten Graul wrote:
On 31/12/2021 10:44, Wen Gu wrote:
On 2021/12/29 8:56 pm, Karsten Graul wrote:
On 28/12/2021 16:13, Wen Gu wrote:
We encountered some crashes caused by the race between the access
and the termination of link groups.
So I think checking conn->alert_token_local has the same effect with checking conn->lgr to
identify whether the link group pointed by conn->lgr is still healthy and able to be used.

Yeah that sounds like a good solution for that! So is it now guaranteed that conn->lgr is always
set and this check can really be removed completely, or should there be a new helper that checks
conn->lgr and the alert_token, like smc_lgr_valid() ?

In my humble opinion, the link group pointed by conn->lgr might have the following
three stages if we remove 'conn->lgr = NULL' from smc_lgr_unregister_conn().

1. conn->lgr = NULL and conn->alert_token_local is zero

This means that the connection has never been registered in a link group. conn->lgr is clearly
unable to use.

2. conn->lgr != NULL and conn->alert_token_local is non-zero

This means that the connection has been registered in a link group, and conn->lgr is valid to access.

3. conn->lgr != NULL but conn->alert_token_local is zero

This means that the connection was registered in a link group before, but is unregistered from
it now. conn->lgr shouldn't be used anymore.


So I am trying this way:

1) Introduce a new helper smc_conn_lgr_state() to check the three stages mentioned above.

enum smc_conn_lgr_state {
SMC_CONN_LGR_ORPHAN, /* conn was never registered in a link group */
SMC_CONN_LGR_VALID, /* conn is registered in a link group now */
SMC_CONN_LGR_INVALID, /* conn was registered in a link group, but now
is unregistered from it and conn->lgr should
not be used any more */
};

2) replace the current conn->lgr check with the new helper.

These new changes are under testing now.

What do you think about it? :)

Thanks,
Wen Gu