Re: [PATCH] hangcheck-timer is broken on x86

From: Yury Polyanskiy
Date: Mon Mar 29 2010 - 17:08:38 EST


>> > What I'm saying is that if you're using getrawmonotonic() to detect
>> > hangs, you might miss them, as getrawmonotonic may wrap (and thus stop
>> > continually increasing) if the timer interrupt is delayed. This does not
>> > apply to systems using the TSC clocksource, but does apply to systems
>> > using the acpi_pm.
>>
>> But if timer interrupt is delayed by more than acpi_pm wrap-around
>> time, then the update_wall_time() is also screwed. Since it is not, we
>> can rely on getrawmonotonic().
>
> Right, if the box hangs for longer then the clocksource can count for,
> the timekeeping subsystem will be off by some multiple of that length.
>

Oh, I see. You mean that getrawmonotonic() wouldn't work under
abnormal conditions. I understand now, sorry for the confusion. You
are correct, of course.

I personally don't like the idea of relying on read_persistent_clock()
not only because of hwclock and ntp. In fact, my core interest in
hangcheck-timer is to set a very low margin (1 to 3 jiffies for
example) so that I would get a log message upon any kernel slow down
or a tick-miss (as a hardware integrity check). I don't think
read_persistent_clock() is precise enough for this purpose, is it?

Also, hooking to ntp update code complicates an otherwise simple
driver. I propose to simply check on non-S390 if the clock source
resolves to something other than TSC and dump a warning message on
driver load (something like "Hangcheck: kernel using clocksource %s,
which is not reliable for hang detection").

What do you think about it?

Thanks,
Yury
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/