Re: [RFC] tcp: race in receive part

From: Jiri Olsa
Date: Tue Jun 23 2009 - 05:13:24 EST


Hi,

thanks for an answer, and sorry for my late reply,
we needed the cust permission to disclose the debug data.


On Thu, Jun 18, 2009 at 04:06:42PM +0200, Eric Dumazet wrote:
> Jiri Olsa a Ãcrit :
> > Hi,
> >
> > in RHEL4 we can see a race in the tcp layer. We were not able to reproduce
> > this on the upstream kernel, but since the issue occurs very rarelly
> > (once per 8 days), we just might not be lucky.
> >
> > I'm affraid this might be a long email, I'll try to structure it nicely.. :)
> >
>
> Thanks for your mail and detailed analysis
>
> >
> >
> > RACE DESCRIPTION
> > ================
> >
> > There's a nice pdf describing the issue (and sollution using locks) on
> > https://bugzilla.redhat.com/attachment.cgi?id=345014
>
> I could not reach this url unfortunatly
>
> --> "You are not authorized to access bug #494404. "

please try it now, the bug should be accessible now

>
> >
> >
> > The race fires, when following code paths meet, and the tp->rcv_nxt and
> > __add_wait_queue updates stay in CPU caches.
> >
> > CPU1 CPU2
> >
> >
> > sys_select receive packet
> > ... ...
> > __add_wait_queue update tp->rcv_nxt
> > ... ...
> > tp->rcv_nxt check sock_def_readable
> > ... {
> > schedule ...
> > if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> > wake_up_interruptible(sk->sk_sleep)
> > ...
> > }
> >
> > If there were no cache the code would work ok, since the wait_queue and
> > rcv_nxt are opposit to each other.
> >
> > Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> > passed the tp->rcv_nxt check and sleeps, or will get the new value for
> > tp->rcv_nxt and will return with new data mask.
> > In both cases the process (CPU1) is being added to the wait queue, so the
> > waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
> >
> > The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> > cache , and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
> > endup calling schedule and sleep forever if there are no more data on the
> > socket.
> >
> > Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
> > should prevent the above bad scenario.
> >
> > The upstream patch is attached. It seems to prevent the issue.
> >
> >
> >
> > CPU BUGS
> > ========
> >
> > The customer has been able to reproduce this problem only on one CPU model:
> > Xeon E5345*2. They didn't reproduce on XEON MV, for example.
>
> Is there an easy way to reproduce the problem ?

there's a reproducer attached to the bug

https://enterprise.redhat.com/issue-tracker/?module=download&fid=201560&key=f6f87caf6ac2dc1eb1173257c8a5ef78

it is basically the client/server program.
They're passing messages to each other. When a message is sent,
both of them expect message on the input before sending another message.

Very rarely the code hits the place when the process which called select
is not woken up by incomming data. Probably because of the memory cache
incoherency. See backtrace in the

https://bugzilla.redhat.com/show_bug.cgi?id=494404#c1


>
> >
> > That CPU model happens to have 2 possible issues, that might cause the issue:
> > (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
> >
> > AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
> > the other one has following notes:
>
> AJ18 only matters on unaligned accesses, tcp code doesnt do this.
>
> >
> > Software should ensure at least one of the following is true when
> > modifying shared data by multiple agents:
> > â The shared data is aligned
> > â Proper semaphores or barriers are used in order to
> > prevent concurrent data accesses.
> >
> >
> >
> > RFC
> > ===
> >
> > I'm aware that not having this issue reproduced on upstream lowers the odds
> > having this checked in. However AFAICS the issue is present. I'd appreciate
> > any comment/ideas.
> >
> >
> > thanks,
> > jirka
> >
> >
> > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 17b89c5..f5d9dbf 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
> > struct tcp_sock *tp = tcp_sk(sk);
> >
> > poll_wait(file, sk->sk_sleep, wait);
>
> poll_wait() calls add_wait_queue() which contains a
> spin_lock_irqsave()/spin_unlock_irqrestore() pair
>
> Documentation/memory-barriers.txt states in line 1123 :
>
> Memory operations issued after the LOCK will be completed after the LOCK
> operation has completed.
>
> and line 1131 states :
>
> Memory operations issued before the UNLOCK will be completed before the
> UNLOCK operation has completed.
>
> So yes, there is no full smp_mb() in poll_wait()
>
> > +
> > + /* Get in sync with tcp_data_queue, tcp_urg
> > + and tcp_rcv_established function. */
> > + smp_mb();
>
> If this barrier is really necessary, I guess it should be done in poll_wait() itself.
>
> Documentation/memory-barriers.txt misses some information about poll_wait()
>
>
>
>
> > +
> > if (sk->sk_state == TCP_LISTEN)
> > return inet_csk_listen_poll(sk);
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 2bdb0da..0606e5e 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -4362,8 +4362,11 @@ queue_and_out:
> >
> > if (eaten > 0)
> > __kfree_skb(skb);
> > - else if (!sock_flag(sk, SOCK_DEAD))
> > + else if (!sock_flag(sk, SOCK_DEAD)) {
> > + /* Get in sync with tcp_poll function. */
> > + smp_mb();
> > sk->sk_data_ready(sk, 0);
> > + }
> > return;
> >
>
> Oh well... if smp_mb() is needed, I believe it should be done
> right before "if (waitqueue_active(sk->sk_sleep) ... "
>
> read_lock(&sk->sk_callback_lock);
> + smp_mb();
> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> wake_up_interruptible(sk->sk_sleep)
>
> It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)
>
> Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
> "lock subl $0x1,(%eax)"
>
> Maybe we could define a smp_mb_after_read_lock() (a compiler barrier() on x86)
>

First version of the patch was actually in this layer, see
https://bugzilla.redhat.com/attachment.cgi?id=345886

I was adviced this could be to invasive (it was in waitqueue_active actually),
so I moved the change to the TCP layer itself...

As far as I understand the problem there's need for 2 barriers to be
sure, the memory will have correct data. One in the codepath calling the
select (tcp_poll), and in the other one updating the available data status
(sock_def_readable), am I missing smth?


thanks,
jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/