Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

From: Ingo Molnar
Date: Sun May 17 2015 - 01:27:15 EST

Next message: Ingo Molnar: "Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead"
Previous message: NeilBrown: "[PATCH - RE-RESEND] IRQ: don't suspend nested_thread irqs over system suspend."
In reply to: Len Brown: "Re: [PATCH] x86: skip delays during SMP initialization similar to Xen"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Len Brown <lenb@xxxxxxxxxx> wrote:

> On Thu, May 14, 2015 at 1:57 PM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> >
> > * "Jan H. Schönherr" <jschoenh@xxxxxxxxx> wrote:
> >
> >> Ingo, do you want an updated version of the original patch, which
> >> takes care not get stuck, when the INIT deassertion is skipped, or
> >> do you prefer to address delays "one by one" as you wrote elsewhere?
> >
> > So I'm not against improving this code at all, but instead of this
> > hard to follow mixing of old and new code, I'd find the following
> > approach cleaner and more acceptable: create a 'modern' and a 'legacy'
> > SMP-bootup variant function, and do a clean separation based on the
> > CPU model cutoff condition used by Len's patches:
> >
> > /* if modern processor, use no delay */
> > if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 6)) ||
> > ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0xF)))
> > init_udelay = 0;
> >
> > Then in the modern variant we can become even more aggressive and
> > remove these kinds of delays as well:
>
> Not sure it is worth two versions, since this is not where the big
> time is spent.

There aren't many low hanging fruits left, so reducing boot time is
going to be 'lots of hard work', and part of it is to make the SMP
bringup path easier to review and tweak.

That having said, we could pass in a delay scaling parameter, which
could be zero for modern hardware. So something like:

udelay(100*scale);

would automatically turn into a 0 delay thing on modern hw. On old
hardware it could be 1.

OTOH there are other legacies here as well, so maybe a split version
would still be the most robust approach, so that we can have a
completely separate legacy path we don't change all that often.

OTOH that would also mean 'we don't test it all that often', which
could result in more breakages than sharing it with modern bootup
code.

So I'm a bit torn about that.

> See below.
>
> >
> > udelay(300);
>
> FWIW, MPS 1.4 suggests this should be 200, not 300.
>
> > udelay(200);
> >
> > plus I'd suggest making these poll loops in smpboot.c loops narrower:
> >
> > udelay(100);
>
> FWIW, on my dekstop, this one executed 17 times (1700usec)
> This is the time for the remote CPU to wake and get to cpu_init().
> Why is it a benefit to have any udelay() before invoking schedule()?

See further below.

>
> > udelay(100);
>
> This one didn't execute at all. Indeed, I don't understand why it exists,
> per question above.
>
> /*
> * Wait till AP completes initial initialization
> */
> while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
> /*
> * Allow other tasks to run while we wait for the
> * AP to come online. This also gives a chance
> * for the MTRR work(triggered by the AP coming online)
> * to be completed in the stop machine context.
> */
> udelay(100);
> schedule();
> }

So this is essentially a poor man's sched_yield().

I'd really love to see this moved to a
wait_for_completion_timeout()/complete() loop instead. For that it
would have to move out from under the irqs-off section on the boot CPU
(so that timer irqs can come in), but that should be OK: the fragile
part is on the AP CPU, not on the BP, I think.

That would allow us, on truly modern hardware, to implement parallel
bringup of all 120 CPUs, i.e. to first do a loop of triggering bootup
on all the secondary CPUs, then collect them in a completion loop.

I'd add an easy debug boot option as well to serialize it all again,
in case something breaks.

This would help parallelize much of the initialization overhead in the
secondary CPUs:

> So, the latest TIP has the INIT udelay(10,000) removed,
> but cpu_up() still takes nearly 19,000 usec on a HSW dekstop.
>
> A quick scan of the ftrace shows some high runners:
>
> 18949.45 us cpu_up()
> 2450.580 us notifier_call_chain
> 102.751 us thermal_throttle_cpu_callback
> 289.313 us dpm_sysfs_add
> 1019.594 us msr_class_cpu_callback
> ...
> 8455.462 us native_cpu_up()
> 500.000 us = udelay(300) + udelay(200) Startup IPI
> 500.000 us = udelay(300) + udelay(200) Startup IPI
> 1700.000 us = 17 x udelay(100) waiting for AP in initialized_map
> 2004.172 us check_tsc_warp()
>
> 7977.799 us cpu_notify()
> 1588.108 us cpuset_cpu_active
> 3043.955 us cacheinfo_cpu_callback
> 1146.234 us mce_cpu_callback
> 541.105 us cpufreq_cpu_callback
> 213.685 us coretemp_cpu_callback
>
>
> cacheinfo_cpu_callback() time appears to be spent creating a bunch
> of sysfs nodes, which is apparetly an expensive operation.
>
> check_tsc_warp() is hard-coded to take 2ms. I don't know if 2ms is a
> magic number or if shorter has same value. It seems a bit sad to do
> this serially for every CPU at boot, when we could do all the CPUs
> in parallel after they are on-line. Perhaps this should be invoked
> only for boot-time and hot-add time. It shouldn't be needed at all
> for soft online and resume.

So how come the TSC isn't X86_FEATURE_CONSTANT_TSC? That should skip
the TSC sync-test.

> Startup IPI delays.
> MPS 1.4 actually says 200+200, not 300+200, as Linux reads.
> I don't know where the 300 came from, maybe it was a typo?

No, I think it was simple paranoia - our udelay() used to be imprecise
(and still can be).

I'd not touch the delays for legacies, I'd create a 'modern' bootup
path with aggressive optimizations, that new hardware should try hard
to hit. If it causes problems on some hw we can quirk those out. If
the quirks look too widespread, we can make things less aggressive.

> msr_class_cpu_callback -- making device nodes is not fast.
>
> I don't know if anything can be done for the 1700us wait
> for the remote processor to mark itself initialized.
> That is the 1st thing it does when it enters cpu_init().

So that 1.7 msecs delay is the firmware in essence?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ingo Molnar: "Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead"
Previous message: NeilBrown: "[PATCH - RE-RESEND] IRQ: don't suspend nested_thread irqs over system suspend."
In reply to: Len Brown: "Re: [PATCH] x86: skip delays during SMP initialization similar to Xen"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]