Re: [Query] Preemption (hogging) of the work handler

From: Viresh Kumar
Date: Tue Jul 12 2016 - 20:18:51 EST


On 12-07-16, 16:19, Viresh Kumar wrote:
> Okay, we have tracked this BUG and its really interesting.
>
> I hacked the platform's serial driver to implement a putchar() routine
> that simply writes to the FIFO in polling mode, that helped us in
> tracing on where we are going wrong.
>
> The problem is that we are running asynchronous printks and we call
> wake_up_process() from the last running CPU which has disabled
> interrupts. That takes us to: try_to_wake_up().
>
> In our case the CPU gets deadlocked on this line in try_to_wake_up().
>
> raw_spin_lock_irqsave(&p->pi_lock, flags);
>
> I will explain how:
>
> The try_to_wake_up() function takes us through the scheduler code (RT
> sched), to the hrtimer code, where we eventually call ktime_get() (for
> the MONOTONIC clock used for hrtimer). And this function has this:
>
> WARN_ON(timekeeping_suspended);
>
> This starts another printk while we are in the middle of
> wake_up_process() and the CPU tries to take the above lock again and
> gets stuck there :)
>
> This doesn't happen everytime because we don't always call ktime_get()
> and it is called only if hrtimer_active() returns false.
>
> This happened because of a WARN_ON() but it can happen anyway. Think
> about this case:
>
> - offline all CPUs, except 0
> - call any routine that prints messages after disabling interrupts,
> etc.
> - If any of the function within wake_up_process() does a print, we are
> screwed.
>
> So the thing is that we can't really call wake_up_process() in cases
> where the last CPU disables interrupts. And that's why my fixup patch
> (which moved to synchronous prints after suspend) really works.

Actually, any printk done from wake_up_process() will hit this, even
if all the others CPUs are up as well :)

Its only BUG_ON() which has special handling in printk, and so we
print that safely.

--
viresh