Re: BUG: IPv4: Attempt to release TCP socket in state 1

From: dormando
Date: Thu Mar 14 2013 - 17:21:17 EST


This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
> On Wed, 2013-03-06 at 16:41 -0800, dormando wrote:
>
> > Ok... bridge module is loaded but nothing seems to be using it. No
> > bond/tunnels/anything enabled. I couldn't quickly figure out what was
> > causing it to load.
> >
> > We removed the need for macvlan, started machines with a fresh boot, and
> > they still crashed without it, after a few hours.
> >
> > Unfortunately I just saw a machine crash in the same way on 3.6.6 and
> > 3.6.9. I'm working on getting a completely pristine 3.6.6 and 3.6.9
> > tested. Our patches are minor but there were a few, so I'm backing it all
> > out just to be sure.
> >
> > Is there anything in particular which is most interesting? I can post lots
> > and lots and lots of information. Sadly bridge/macvlan weren't part of the
> > problem. .config, sysctls are easiest I guess? When this "hang" happens
> > the machine is still up somewhat, but we lose access to it. Syslog is
> > still writing entries to disk occasionally, so it's possible we could set
> > something up to dump more information.
> >
> > It takes a day or two to cycle this, so it might take a while to get
> > information and test crashes.
>
> Thanks !
>
> Please add a stack trace, it might help :
>
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 68f6a94..1d4d97e 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -141,8 +141,9 @@ void inet_sock_destruct(struct sock *sk)
> sk_mem_reclaim(sk);
>
> if (sk->sk_type == SOCK_STREAM && sk->sk_state != TCP_CLOSE) {
> - pr_err("Attempt to release TCP socket in state %d %p\n",
> - sk->sk_state, sk);
> + pr_err("Attempt to release TCP socket family %d in state %d %p\n",
> + sk->sk_family, sk->sk_state, sk);
> + WARN_ON_ONCE(1);
> return;
> }
> if (!sock_flag(sk, SOCK_DEAD)) {

[58377.436522] IPv4: Attempt to release TCP socket family 2 in state 1
ffff8813fbad9500

Current information:

- 3.6, 3.7, 3.8 pristine all crash within a day.

- 3.2.40 pristine does not hang (so far, about 1.5 days). 3.2 does have a
TCPBacklogDrop issue so far as I can see.

I found some discussions on this (in which you participated) that
produced patches like 252562c207a850106d9d5b41a41d29f96c0530b7 for ixgbe -
in 3.5 I think. So my next target to try will be 3.4, before these
performance patches went in.

What's really bizarre is that we've been running 3.6.6 + some patches and
only getting this crash rarely. 3.6.6 pristine can barely last a few
hours. I've isolated the patch which "mitigates" this issue to only happen
under odd load spikes or.. rarely randomly, instead of consistently after
8-12 hours.

This is the 10g ESTATS patch. We had applied the patches, but *not* loaded
the module nor ever ran any of the utils. This patch adds some branches
and counters to some TCP hot paths and apparently slows things down just
enough to make the race/hang more rare.

I'm attaching it below as it was the 3.5 kernel patch hand-massaged a bit
for 3.6.6 and I wanted to supply the exact code which is accidentally
mitigating the hang.

Below the patch I'm inlining the .config for 3.6.6 and then the .config
for 3.8.2.

Please let me know what else would be useful! I feel like this + the
original traces should narrow it down a lot. I can also add more panic
traces from other crashes, but they look pretty similar to me.

thanks!
-Dormando