Re: time warps, I despair

Ingo Molnar (
Fri, 1 Nov 1996 18:25:33 +0100 (MET)

On Thu, 31 Oct 1996, Ulrich Windl wrote:

> So what's the effect? From time to time the clock offset jumps for
> some amount always less or equal to what's worth one tick (i.e.
> offset jumps around from -5ms to 5ms when synchronized). [...]

this could be caused by a 'lost jiffy'. Lost jiffy is a timer hardware
interrupt that misses us completely. Under normal hardware circumstances
(no APM, etc.), this is only possible if there is a cli()/sti() pair that
lasts for more than 10 millisec.

You can totally eliminate the Pentium stuff by turning APM support on in
the kernel (or commenting out the last few lines in time.c), thus the PIC
timer is used as high precision timing offset source.

back to the lost jiffy issue. The nice graphs you made clearly show that
the offset >jump< is always exactly 10 millisecs (1 jiffy). The graphs
show nicely how the NTP code syncronizes the linux time to the external
time. But i have problems with the sign of the offset. If we loose a real
jiffy, then the Linux time 'goes back', thus Linux time has a negative
offset to the external time. Your graphs show a positive offset jump. This
might be a misunderstanding between me and your graphs, or might be a sign
of another problem: jiffies jumping forwards. This could be anything, bad
hardware producing spurious timer interrupts, or some bug in the code
(although i cant imagine any easy way how this could happen).

Otherwise, note that the pentium stuff is totally independent of xtime!
The pentium stuff just provides a >small scale< precision offset to xtime.
Thus even if the pentium stuff would be badly broken, it couldnt produce
>global< offset jumps. gettimeofday() always uses xtime as a base, the
pentium stuff is just a small offset to this value. And your graphs show
a clear global xtime problem. The Linux clock really jumps.

now if it's lost jiffies ... i think you should try to identify suspicious
cli()/sti() latency sources. Large cli()/sti() delays can be caused be the
following components:

- driver initialization code that has to wait for a certain amount of
time, with IRQs turned off ... if you have such a driver as a module,
then the sawtooth graph might be due to kerneld loading and unloading
this module, and causing a lost jiffy.

- the VGA console stuff still causes larger latencies under certain
circumstances. Harald Koenig told me that in earlier Linux kernels the
screensaver code copied the whole VGA bank out just to blank the screen
... causing 20 msecs of latencies ... maybe there is still some nasty
code in the console stuff wich went unnoticed?

the rest of the system should behave good. [if've done measurements which
showed that the typical max cli()/sti() latency is under 10 usecs,
under 100 usecs for disk and networking activity, and sometimes going over
100 usecs for console stuff ... but this was some time ago and memory
fades ... can dig it out] [but the numbers were nowhere near to cause a
lost jiffy]

[ the numbers are the maximum latency measured for a particular
cli()/sti() pair, over a larger period of time ]

and we could largely reduce the range of suspicous code if you could do
the tests on a completely different machine too.

-- mingo