Re: watchdog: print stolen time increment at softlockup detection

From: Marcelo Tosatti
Date: Wed Jul 03 2013 - 22:33:11 EST


On Wed, Jul 03, 2013 at 12:44:01PM -0400, Don Zickus wrote:
> On Fri, Jun 28, 2013 at 05:37:39PM -0300, Marcelo Tosatti wrote:
> > On Fri, Jun 28, 2013 at 10:12:15AM -0400, Don Zickus wrote:
> > > On Thu, Jun 27, 2013 at 11:57:23PM -0300, Marcelo Tosatti wrote:
> > > >
> > > > One possibility for a softlockup report in a Linux VM, is that the host
> > > > system is overcommitted to the point where the watchdog task is unable
> > > > to make progress (unable to touch the watchdog).
> > >
> > > I think I am confused on the VM/host stuff. How does an overcommitted
> > > host prevent a high priority task like the watchdog from running?
> > >
> > > Or is it the watchdog task on the VM that is being blocked from running
> > > because the host is overcommitted and can't run the VM frequent enough?
> >
> > Yes, thats the case.
> >
> > > The latter would make sense, though I thought you solved that with the
> > > other kvm splat in the watchdog code a while ago. So I would be
> > > interested in understanding why the previous solution isn't working.
> >
> > That functionality is for a notification so the guest ignores the time
> > jump induced by a vm pause. This problem is similar to the kgdb case.
> >
> > > Second, I am still curious how this problem differs from say kgdb or
> > > suspend-hibernate/resume. Doesn't both of those scenarios deal with a
> > > clock that suddenly jumps forward without the watchdog task running?
> >
> > The difference is this:
> >
> > The present functionality in watchdog.c allows the hypervisor to notify
> > the guest that it should ignore the large delta seen via clock reads
> > (at the watchdog timer interrupt).
> > This notification is used for the case where the vm has been paused for
> > a period of time.
>
> But why do this at the watchdog timer interrupt? I thought this would be
> done at the lower layer like in sched_clock() or something.
>
> >
> > Are you suggesting the host should silence the guest watchdog, also in
> > the overcommitment case? Issues i see with that:
> >
> > 1) The host is not aware of the variable softlockup threshold in
> > the guest.
> >
> > 2) Whatever the threshold of overcommitment for sending the ignore
> > softlockup notification to the guest, genuine softlockup detections in
> > the guest could be silenced, given proper conditioning.
>
> No. That would be difficult as you described. What I am trying to get at
> is, doesn't the guest /know/ time jumped when it schedules again? And
> can't it determine based on this jump that something unreasonable
> happened like a long pause or and overcommit?

A large jump alone is not enough information to reset the watchdog(s).

For example for this large jump scenario:

1. guest instruction exits to host for emulation.
2. emulation completes after 10 minutes, resumes execution at
next instruction.
3. watchdog detects jump and prints a warning.

If the jump is due to inefficiency or incorrect emulation, the message
should be printed.
If the jump is due to a vm pause, the message should not be printed.

> > And why overcommitment is not a valid reason to generate a softlockup in
> > the first place ?
>
> For the guest I don't believe it is. It isn't the guest's fault it
> couldn't run processes. A warning should be scheduled on the host that it
> couldn't run a process in a very long time.
>
> > > For some reason I had the impression that when a VM starts running again,
> > > one of the first things it does it sync up its clock again (which leads to
> > > a softlockup shortly thereafter in the case of paused/overcommitted VMs)?
> >
> > Sort of, the kvmclock counts while the VM is running (whether is
> > overcommitted or not).
>
> Does comparing the kvmclock with the current clock indicate that a long
> pause or an overcommit occurred?

By current clock you mean system clock? sched_clock() reads from
kvmclock.

> > > At that time I would have thought that the code could detect a large jump
> > > in time and touch_softlockup_watchdog_sync() or something to delay the
> > > check until the next cycle.
> >
> > But this would silence any softlockups that are due to delays
> > in the host causing the watchdog task to make progress (eg:
> > https://lkml.org/lkml/2013/6/20/633, in that case if 1 operation took
> > longer than expected your suggestion would silence the report).
>
> Ok. I don't fully understand that problem, the changelog was a little
> vague.

That problem is described in the large jump scenario with guest
instruction exiting for emulation (in the beginning of this message).

> > > That would make the watchdog code alot less messier than having custom
> > > kvm/paravirt splat all over it. Generic solutions are always nice. :-)
> >
> > Can you give more detail on what the suggestion is and how can you deal
> > with points 1 and 2 above?
>
> I don't have a good suggestion, just a lot of questions really. The thing
> is there are lots of watchdogs in the system (ie clock watchdog,
> filesystem watchdog, rcu stalls, etc). Solving this problem just for the lockup
> watchdog doesn't seem right because if the lockup timeout was longer, you
> would probably hit the other watchdogs too.

Agree. However, can't see how there is a way around "having custom
kvm/paravirt splat all over", for watchdogs that do:

1. check for watchdog resets
2. read time via sched_clock or xtime.
3. based on 2, decide whether there has been a longer delay than
acceptable.

This is the case for the softlockup timer interrupt. So the splat there
is necessary (otherwise any potential notification of vm-pause event
noticed at 2 might be missed because its checked at 1).

For watchdogs that measure time based on interrupt event (such as hung
task, rcu_cpu_stall, checking for the notification at sched_clock or
lower is fine).

> So my suggestion (based on my ignorance of how the clock code works) is
> that some sort of generic mechanism be applied to all the watchdogs. Much
> like how kgdb touches all of them at once when it handles an exception.
>
> For example, unpausing a guest could be a good time to touch all the
> watchdogs as you have no idea how long the pause was. I can't think of
> any hook for an overcommit though.

Its a good suggestion - will write a patch to touch watchdogs at read
of kvmclock.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/