Re: Soft-Lockup/Race in networking in 2.6.31-rc1+195 (possibly?caused by netem)

From: Jarek Poplawski
Date: Thu Jul 09 2009 - 06:44:59 EST


On Thu, Jul 09, 2009 at 12:31:53PM +0200, Thomas Gleixner wrote:
> On Thu, 9 Jul 2009, Jarek Poplawski wrote:
> > On Thu, Jul 09, 2009 at 12:23:17AM +0200, Andres Freund wrote:
> > ...
> > > Unfortunately this just yields the same backtraces during softlockup and not
> > > earlier.
> > > I did not test without lockdep yet, but that should not have stopped the BUG
> > > from appearing, right?
> >
> > Since it looks like hrtimers now, these changes in timers shouldn't
> > matter. Let's wait for new ideas.
>
> Some background:
...
> There is another oddity in cbq_undelay() which is the hrtimer callback
> function:
>
> if (delay) {
> ktime_t time;
>
> time = ktime_set(0, 0);
> time = ktime_add_ns(time, PSCHED_TICKS2NS(now + delay));
> hrtimer_start(&q->delay_timer, time, HRTIMER_MODE_ABS);
>
> The canocial way to restart a hrtimer from the callback function is to
> set the expiry value and return HRTIMER_RESTART.

OK, that's for later because we didn't use cbq here.

>
> }
>
> sch->flags &= ~TCQ_F_THROTTLED;
> __netif_schedule(qdisc_root(sch));
> return HRTIMER_NORESTART;
>
> Again, this should not cause the timer to be enqueued on another CPU
> as we do not enqueue on a different CPU when the callback is running,
> but see above ...
>
> I have the feeling that the code relies on some implicit cpu
> boundness, which is not longer guaranteed with the timer migration
> changes, but that's a question for the network experts.

As a matter of fact, I've just looked at this __netif_schedule(),
which really is cpu bound, so you might be 100% right.

Thanks for your help,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/