Re: clock freezes??

From: john stultz
Date: Tue Aug 11 2009 - 12:42:29 EST

Next message: Roland Dreier: "Re: [PATCH 12/14] scsi: use printk_once"
Previous message: Roland Dreier: "Re: [PATCH 10/14] infiniband: use printk_once"
In reply to: Victor Mataré: "Re: clock freezes??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 2009-08-11 at 17:39 +0200, Victor MatarÃ wrote:
> john stultz wrote:
> > On Fri, Jul 24, 2009 at 7:07 AM, Victor MatarÃ<matare@xxxxxxxxxxxxxxxxxx> wrote:
> >> I have a dual Xeon server (old Xeon HT) with an Intel E7505 chipset,
> >> with hrtimer and dynticks enabled. On bootup, the kernel
> >> (2.6.29-gentoo-r5) tells me it's using the PM-Timer bug workaround, but
> >> then it uses tsc as clocksource. Now the clock was running slow for
> >> about 15sec/12hrs, which is quite a lot. So in a careless moment, I just
> >> tried "echo jiffies > clocksource0/current_clocksource". This froze the
> >> system time. Now I couldn't switch back to tsc or acpi_pm, echoing those
> >> was just ignored. Subsequently, the entire system locked up and I needed
> >> to reboot.
> >>
> >> Now what does that mean? Is this supposed to happen? Should I disable
> >> dynticks and/or hrtimer?
> >
> > The system lockup is a known issue and should be resolved with the
> > following commit:
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3f68535adad8dd89499505a65fb25d0e02d118cc
> >
> > I might be curious if you could expand a bit more about the clock skew
> > (15sec per 12 hours) you're seeing. Are you running NTP? Do you have
> > the output of ntpdc -c kerninfo , ntpdc -c peers? Do you see lots of
> > ntp messages in /var/log/messages or /var/log/syslog ?
> >
> > thanks
> > -john
>
> Until now, I was just using BSD netdate, which kept adding 12-25 seconds
> every 24 hours.
> The whole issue is related to strange lockups I had been seeing about
> monthly, apparently everytime the clock was rewound instead of put
> forward (clock freezes, programs hang, system ends up deadlocked within
> 10-300 minutes depending on usage). The system is a production
> fileserver acting mainly as a Samba PDC, so testing this scenario is
> quite difficult. Now recently, I swapped the motherboard including RAM
> and CPU with our webserver, which seems to have removed the monthly
> time-freeze, but led to the above-mentioned freeze caused by me
> experimenting with clocksource=jiffies because of the slow clock.
> However, the monthly freezes upon rewinding the clock may be gone now
> just because the clock is consistently running slow, so it doesn't need
> to be rewound any more.
> I've just switched both systems to ntpd:
>
> # ntpdc -c kerninfo
> pll offset: 0 s
> pll frequency: 0.000 ppm
> maximum error: 16 s
> estimated error: 16 s
> status: 0040 unsync
> pll time constant: 4
> precision: 1e-06 s
> frequency tolerance: 500 ppm

So here its reporting you're still in unsync mode, so ntp hasn't run
long enough to do any corrections. You might let the system run with
ntpd for a few hours (or a day) and then try re-try the "ntpdc -c
kerninfo" command.

> # ntpdc -c peers -n
> remote local st poll reach delay offset disp
> =======================================================================
> =134.130.4.17 137.226.164.2 1 64 377 0.00061 0.156253 0.03084
> *134.130.5.17 137.226.164.2 1 64 377 0.00035 0.205312 0.03041
>
> Dunno how to interpret that. Syslog now gives:
>
> Aug 11 17:13:02 bussard ntpd[21845]: ntpd 4.2.4p7@xxxxxxxx Tue Jun 23
> 10:58:51 UTC 2009 (1)
> Aug 11 17:13:02 bussard ntpd[21874]: precision = 1.000 usec
> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #1 lo,
> 127.0.0.1#123 Enabled
> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #2 eth0,
> 137.226.164.2#123 Enabled
> Aug 11 17:13:02 bussard ntpd[21874]: Listening on interface #3 eth0:1,
> 192.168.23.3#123 Enabled
> Aug 11 17:13:02 bussard ntpd[21874]: kernel time sync status 0040
> ...
> Aug 11 17:16:18 bussard ntpd[21874]: synchronized to 134.130.4.17, stratum 1
> Aug 11 17:16:32 bussard ntpd[21874]: time reset +13.979355 s
> ...
> Aug 11 17:20:43 bussard ntpd[21874]: synchronized to 134.130.5.17, stratum 1

I'd also be interested in seeing the syslog data as well after a day or
so of running ntpd.

> However, the issue of the clock freezing upon time-rewind still remains
> quite unclear to me. Can it be caused by the careless way in which
> netdate does it? Can it be related to the jiffies-hrtimer issue?

I'm not familiar with netdate, however the time should be able to be set
in either forward or backward direction without the system hanging.

Can you trigger the problem yourself using date -s ?

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Roland Dreier: "Re: [PATCH 12/14] scsi: use printk_once"
Previous message: Roland Dreier: "Re: [PATCH 10/14] infiniband: use printk_once"
In reply to: Victor Mataré: "Re: clock freezes??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]