Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

From: Jeff Merkey
Date: Wed Jan 20 2016 - 11:40:16 EST


On 1/20/16, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> Jeff,
>
> On Wed, 20 Jan 2016, Thomas Gleixner wrote:
>> On Tue, 19 Jan 2016, Jeff Merkey wrote:
>> > Nasty bug but trivial fix for this. What happens here is RAX (nsecs)
>> > gets set to a huge value (RAX = 0x17AE7F57C671EA7D) and passed through
>>
>> And how exactly does that happen?
>>
>> 0x17AE7F57C671EA7D = 1.70644e+18 nsec
>> = 1.70644e+09 sec
>> = 2.84407e+07 min
>> = 474011 hrs
>> = 19750.5 days
>> = 54.1109 years
>>
>> That's the real issue, not what you are trying to 'fix' in
>> timespec_add_ns()
>
> And that's caused by stopping the whole machine for 20 minutes. It violates
> the assumption of the timekeeping core, that the maximum time which is
> between
> two updates of the core is < 5-10min. So that insane large number is caused
> by a
> mult overrun when converting the time delta to nanoseconds.
>
> You can find that limit via:
>
> # dmesg | grep tsc | grep max_idle_ns
> [ 5.242683] clocksource tsc: mask: 0xffffffffffffffff max_cycles:
> 0x21139a22526, max_idle_ns: 440795252169 ns
>
> So on that machine the limit is:
>
> 440795252169 nsec
> 440.795 sec
> 7.34659 min
>
> And before you ask or come up with patches: No, we are not going to add
> anything to the core timekeeping code to work around this limitation simply
> because its going to add overhead to a performance sensitive code path for
> a
> very limited value.

Given how fragile that code appears to be, this is reasonable.

>
> Keeping a machine stopped for 20 minutes will make a lot of other things
> unhappy, so introducing a 'fix' for that particular issue is just silly.
>

You know what's needed here is some form of touch function to keep this
system updated while spinning in the debugger. That would solve it.
I can maintain
a fix for that locally. I debugged the soft hang in systemd last
night, and I discovered
that its all related to this function returning bogus time (systemd
was doing a system call that eventually made its way to ktime_get_ts64
and got returned garbage). When this wraps it causes all sorts of
bad stuff.

Do you have any suggestions on how a touch function could be coded to keep this
subsystem updated while the debugger is active? There are already a
few of them I
have to call as well as kgdb and kdb to get around some of this.

void mdb_watchdogs(void)
{
touch_softlockup_watchdog_sync();
clocksource_touch_watchdog();

#if defined(CONFIG_TREE_RCU)
rcu_cpu_stall_reset();
#endif

touch_nmi_watchdog();
#ifdef CONFIG_HARDLOCKUP_DETECTOR
touch_hardlockup_watchdog();
#endif
return;
}

As you can see, there are already quite a few subsystems that manage
this problem of
debuggers holding the system in stasis.

Jeff

> Thanks,
>
> tglx
>

Well, that explains it.