Re: [PATCH] a patch to fix the cpu-offline-online problem caused by pm_idle

From: Luming Yu
Date: Sun Jan 30 2011 - 22:26:12 EST


On Mon, Jan 31, 2011 at 12:36 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Sat, 2011-01-29 at 13:44 +0800, Luming Yu wrote:
>> On Fri, Jan 28, 2011 at 6:30 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> >> We have seen an extremely slow system under the CPU-OFFLINE-ONLIE test
>> >> on a 4-sockets NHM-EX system.
>> >
>> > Slow is OK, cpu-hotplug isn't performance critical by any means.
>>
>> Here is one example that the "slow" is not acceptable. Maybe I should
>> not use "slow" in the first place. It happnes after I resolved a
>> similar NMI watchdog warnning in calibrate_delay_direct..
>>
>> Please note, I got the BUG in a 2.6.32-based kernel. Upstream Âbehaves
>> similar I guess.
>
> Guessing is totally the wrong thing when you're sending stuff upstream,
> esp ugly patches such as this. .32 is more than a year old, anything
> could have happened.

Ok. the default upstream kernel seems to have NMI watchdog disabled?
since there is 0 NMI.

# cat /proc/interrupts | grep -i nmi
NMI: 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Non-maskable interrupts
[root@intel-s3e36-02 ~]# uptime
21:43:27 up 34 min, 2 users, load average: 0.40, 1.27, 2.32

But there is a problem like this:

Booting Node 3 Processor 59 APIC 0x75
Clocksource tsc unstable (delta = -77316000544 ns)
Switching to clocksource hpet


>
>> BUG: soft lockup - CPU#63 stuck for 61s! [migration/63:256]
>
>> > If its slow but working, the test is broken, I don't see a reason to do
>> > anything to the kernel, let alone the below.
>>
>> It not working sometimes, so I think it's not a solid feature right now.
>
> But you didn't say anything about not working, you merely said slow. If
> its not working, you need to very carefully explain what is not working,
> where its deadlocked and how your patch solves this and how you avoid
> wrecking stuff for everybody else.

It's not working because of NMI watchdog. If you ignore NMI watchdog,
then I guess it works but just slow..

>
>> >> Âsince it currently unnecessarily implicitly interact with
>> >> CPU power management.
>> >
>> > daft statement at best, because if not for some misguided power
>> > management purpose, what are you actually unplugging cpus for?
>> > (misguided because unplug doesn't actually safe more power than simply
>> > idling the cpu).
>> It's a RAS feature and Suspend Resume also hits same code path I think.
>
> That still doesn't say anything, also who in his right mind suspends a
> nhm-ex system?

But we need to have a solid code in place. We can't blame user who
could find something useful with trying that. Let hotplug code to
implicitly interact with CPU PM will complicate things unnecessarily.

>
>> > So you flip the pm_idle pointer protected unter hotplug mutex, but
>> > that's not serialized against module loading, so what happens if you
>> > concurrently load a module that sets another idle policy?
>> >
>> > Your changelog is vague at best, so what exactly is the purpose here? We
>> > flip to default_idle(), which uses HLT, which is C1. Then you run
>> > cpu_idle_wait(), which will IPI all cpus, all these CPUs (except one)
>> > could have been in deep C states (C3+) so you get your slow wakeup
>> > anyway.
>> >
>> > There-after you do the normal stop-machine hot-plug dance, which again
>> > will IPI all cpus once, then you flip it back to the saved pm_idle
>> > handler and again IPI all cpus.
>>
>> https://lkml.org/lkml/2009/6/29/60
>> it needs 50-100us latency to send one IPI, you could get an idea on a
>> large NHM-EX system which contains 64 logical processors. With
>> Tickless and APIC timer stopped in C3 on NHM-EX, you could also have
>> an idea about the problem I have.
>
> Ok, so one IPI costs 50-100 us, even with 64 cpu, that's at most 6.4ms
> nowhere near enough to trigger the NMI watchdog. So what does go wrong?

Good question!
But we also can't forget there were large latency from C3.
And I guess some reschedule ticks get lost to kick some CPUs out of
idle due to the side effects of the CPU PM feature. if use nohz=off,
everything seems to just work.
Yes, I agree we need to dig it out either.
But it's kind of combination problem between the special stop_machine
context and CPU power management...

>
> Why does your patch solve things, like said, it doesn't avoid the slow
> IPI at all, you still IPI each cpu right after changing the pm_idle
> function. Those IPIs will still hit C3+ states.
>
>> Let me know if there are still questions .
>
> Yeah, what are you smoking? Why do you wreck perfectly fine code for one
> backward ass piece of hardware.

Just make things less complex...

Thanks
Luming
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/