Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS back to DEFAULT

From: Andy Lutomirski
Date: Thu Aug 14 2014 - 21:27:58 EST


On Thu, Aug 14, 2014 at 6:21 PM, Liu, Chuansheng
<chuansheng.liu@xxxxxxxxx> wrote:
>
>
>> -----Original Message-----
>> From: Andy Lutomirski [mailto:luto@xxxxxxxxxxxxxx]
>> Sent: Friday, August 15, 2014 5:23 AM
>> To: Peter Zijlstra
>> Cc: Daniel Lezcano; Liu, Chuansheng; Rafael J. Wysocki;
>> linux-pm@xxxxxxxxxxxxxxx; LKML; Liu, Changcheng; Wang, Xiaoming;
>> Chakravarty, Souvik K
>> Subject: Re: [PATCH] cpuidle: Fix the CPU stuck at C0 for 2-3s after PM_QOS
>> back to DEFAULT
>>
>> On Thu, Aug 14, 2014 at 2:16 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> wrote:
>> > On Thu, Aug 14, 2014 at 02:12:27PM -0700, Andy Lutomirski wrote:
>> >> On 08/14/2014 04:14 AM, Daniel Lezcano wrote:
>> >> > On 08/14/2014 01:00 PM, Peter Zijlstra wrote:
>> >> >>
>> >> >> So seeing how you're from @intel.com I'm assuming you're using x86
>> here.
>> >> >>
>> >> >> I'm not seeing how this can be possible, MWAIT is interrupted by IPIs
>> >> >> just fine, which means we'll fall out of the cpuidle_enter(), which
>> >> >> means we'll cpuidle_reflect(), and then leave cpuidle_idle_call().
>> >> >>
>> >> >> It will indeed not leave the cpu_idle_loop() function and go right back
>> >> >> into cpuidle_idle_call(), but that will then call cpuidle_select() which
>> >> >> should pick a new C state.
>> >> >>
>> >> >> So the interrupt _should_ work. If it doesn't you need to explain why.
>> >> >
>> >> > I think the issue is related to the poll_idle state, in
>> >> > drivers/cpuidle/driver.c. This state is x86 specific and inserted in the
>> >> > cpuidle table as the state 0 (POLL). There is no mwait for this state.
>> >> > It is a bit confusing because this state is not listed in the acpi /
>> >> > intel idle driver but inserted implicitly at the beginning of the idle
>> >> > table by the cpuidle framework when the driver is registered.
>> >> >
>> >> > static int poll_idle(struct cpuidle_device *dev,
>> >> > struct cpuidle_driver *drv, int index)
>> >> > {
>> >> > local_irq_enable();
>> >> > if (!current_set_polling_and_test()) {
>> >> > while (!need_resched())
>> >> > cpu_relax();
>> >> > }
>> >> > current_clr_polling();
>> >> >
>> >> > return index;
>> >> > }
>> >>
>> >> As the most recent person to have modified this function, and as an
>> >> avowed hater of pointless IPIs, let me ask a rather different question:
>> >> why are you sending IPIs at all? As of Linux 3.16, poll_idle actually
>> >> supports the polling idle interface :)
>> >>
>> >> Can't you just do:
>> >>
>> >> if (set_nr_if_polling(rq->idle)) {
>> >> trace_sched_wake_idle_without_ipi(cpu);
>> >> } else {
>> >> spin_lock_irqsave(&rq->lock, flags);
>> >> if (rq->curr == rq->idle)
>> >> smp_send_reschedule(cpu);
>> >> // else the CPU wasn't idle; nothing to do
>> >> raw_spin_unlock_irqrestore(&rq->lock, flags);
>> >> }
>> >>
>> >> In the common case (wake from C0, i.e. polling idle), this will skip the
>> >> IPI entirely unless you race with idle entry/exit, saving a few more
>> >> precious electrons and all of the latency involved in poking the APIC
>> >> registers.
>> >
>> > They could and they probably should, but that logic should _not_ live in
>> > the cpuidle driver.
>>
>> Sure. My point is that fixing the IPI handler is, I think, totally
>> bogus, because the IPI API isn't the right way to do this at all.
>>
>> It would be straightforward to add a new function wake_if_idle(int
>> cpu) to sched/core.c.
>>
> Thanks Andy and Peter's suggestion, it will save some IPI things in case the cores are not
> in idle.

This isn't quite right. Using the polling interface correctly will
save IPIs in case the core *is* idle. But, given that you are trying
to upgrade the chosen idle state, I don't think you need to kick
non-idle CPUs at all, and my example contains that optimization.

Presumably the function should be named something like wake_up_if_idle.

>
> There is one similar API in sched/core.c wake_up_idle_cpu(),
> then just need add one new common smp API:
>
> smp_wake_up_cpus() {
> for_each_online_cpu()
> wake_up_idle_cpu();
> }
>
> Will try one patch for it.

This will have lots of extra overhead if the cpu is *not* idle. I
think my example will be a lot more efficient.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/