Re: Question on tw_timer TIMER_PINNED

From: Valentin Schneider
Date: Tue Oct 03 2023 - 10:10:15 EST


Hi,

On 06/09/23 14:10, Eric Dumazet wrote:
> On Wed, Sep 6, 2023 at 1:58 PM Juri Lelli <juri.lelli@xxxxxxxxxx> wrote:
>>
>> Hi Eric,
>>
>> I'm bothering you with a question about timewait_sock tw_timer, as I
>> believe you are one of the last persons touching it sometime ago. Please
>> feel free to redirect if I failed to git blame it correctly.
>>
>> At my end, latency spikes (entering the kernel) have been reported when
>> running latency sensitive applications in the field (essentially a
>> polling userspace application that doesn't want any interruption at
>> all). I think I've been able to track down one of such interruptions to
>> the servicing of tw_timer_handler. This system isolates application CPUs
>> dynamically, so what I think it happens is that at some point tw_timer
>> is armed on a CPU, and it is PINNED to that CPU, meanwhile (before the
>> 60s timeout) such CPU is 'isolated' and the latency sensitive app
>> started on it. After 60s the timer fires and interrupts the app
>> generating a spike.
>>
>> I'm not very familiar with this part of the kernel and from staring
>> at code for a while I had mixed feeling about the need to keep tw_timer
>> as TIMER_PINNED. Could you please shed some light on it? Is it a strict
>> functional requirement or maybe a nice to have performance (locality I'd
>> guess) improvement? Could we in principle make it !PINNED (so that it
>> can be moved/queued away and prevent interruptions)?
>>
>
> It is a functional requirement in current implementation.
>
> cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer")
> changelog has some details about it.
>
> Can this be changed to non pinned ? Probably, but with some care.
>
> You could simply disable tw completely, it is a best effort mechanism.
>

So it's looking like doing that is not acceptable for our use-case, as
we still want timewait sockets for the traffic happening on the
housekepeing (non-isolated) CPUs.


I had a look at these commits to figure out what it would take to make it
not pinned:

cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer")
ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")

and I'm struggling to understand why we want the timer to be armed before
inet_twsk_hashdance(). I found this discussion on LKML:

https://lore.kernel.org/all/56941035.9040000@xxxxxxxxxx/

And I can see that __inet_lookup_established() and tw_timer_handler()
both operate on __tw_common.skc_nulls_node and __tw_common.skc_refcnt, but:
- the timer has its own count in the refcount
- sk_nulls_for_each_rcu() is (on paper) safe to run concurrently with
tw_timer_handler
`\
inet_twsk_kill()
`\
sk_nulls_del_node_init_rcu()

So I'm thinking we could let the timer be armed after the *hashdance(), so
it wouldn't need to be pinned anymore, but that's pretty much a revert of
ed2e92394589 ("tcp/dccp: fix timewait races in timer handling")
which fixed a race.

Now this is the first time I poke my nose into this area and I can't
properly reason how said race is laid out. I'm sorry for asking about such
an old commit, but would you have any pointers on that?

Thanks