Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups

From: Don Zickus
Date: Mon Jul 17 2017 - 10:46:44 EST


On Mon, Jul 17, 2017 at 01:24:23AM +0000, Liang, Kan wrote:
> Hi Don & Thomas,
>
> Sorry for the late response. We just finished the tests for all proposed patches.
>
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first proposal.
> https://patchwork.kernel.org/patch/9803033/
> https://patchwork.kernel.org/patch/9805903/
> Patch 3: my original proposal which increase the NMI watchdog timeout by 3X
> https://patchwork.kernel.org/patch/9802053/
>
> According to our test, only patch 3 works well.
> The other two patches will hang the system eventually.
> For patch 1, the system hang after running our test case for ~1 hour.
> For patch 2, the system hang in running the overnight test.
> There is no error message shown when the system hang. So I don't know the
> root cause yet.

Hi Kan,

Thanks for the feedback. Odd that the different patches had different
results. What is more odd to me is the hang. I thought these were all
false lockups that prematurely panic'd and rebooted the box.

Is the machine configured to panic on hardlockup and reboot? Perhaps kdump
is enabled to store the console log for review upon reboot?

It almost implies that a hardlockup did happen but isnt' being detected
until later??
>
> BTW: We set 1 to watchdog_thresh when we did the test.
> It's believed that can speed up the failure.

Sure, you/they look for 1 second hangs instead of 10 second ones. But with
patch3 it is more like 3 seconds'ish vs 30 second'ish.

As Thomas asked, I would also be interested in the way the test works. The
hang doesn't make sense.

Cheers,
Don