Re: Hang and Soft Lockup problems with generic time code

From: James Bottomley
Date: Sat Jul 08 2006 - 00:35:14 EST


On Fri, 2006-07-07 at 16:39 -0700, john stultz wrote:
> Yep. This has been seen where a large number of ticks are lost. Roman
> and I are working on a solution for this (I sent a patch out to the
> list
> earlier today for it, and Roman *just* posted his version a moment ago
> -
> if you can give one or both of them a try it would be appreciated).

Well, the patch you posted here:

Message-ID: 1152298515.5330.12.camel () localhost ! localdomain

Seems to work fine, thanks. I'm not sure what I'm looking for for the
other one.


> Did you really mean jumps of 200 seconds? Hmmm. The issue Roman and I
> have been looking into does occur when we lose a number of ticks and
> that confuses the clocksource adjustment code. The fix we're working
> on
> corrects the adjustment confusion, but doesn't fix the lost ticks.
>
> However 200 seconds of lost ticks sounds very off. Could the driver be
> disabling interrupt for such a long period of time?

Well, what I was seeing was that

clocksource_read(clock) - clock->cycle_last

is returning a value about 200 x clock->cycle_interval

According to the debugging printks I put into update_wall_time(). I was
assuming this was caused by a jump in the TSC count, but I suppose it
could also be cause by spurious alterations to cycle_last or other
effects I haven't traced.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/