Re: Reduce Linux boot time on Large scale system

From: Thomas Gleixner
Date: Wed Apr 19 2017 - 04:58:35 EST


On Wed, 19 Apr 2017, Peter Zijlstra wrote:
> On Tue, Apr 04, 2017 at 04:39:06PM +0000, Noam Camus wrote:
> > Hi Peter & Vineet
> >
> > I wish to reduce boot time of my platform ARC/plat-eznps (4K CPUs).
> > My analysis is that most boot time is spent over cpu_up() for all CPUs
> > Measurements are about 66mS per CPU and Totally over 4 minutes (I got 800MHz cores).
> >
> > I see that smp_init() just iterate over all present cpus one by one.
> > I wish to know if there was an attempt to optimize this with some parallel work?
> >
> > Are you aware of some method / trick that will help me to reduce boot time?
> > Any suggestion how this can be done?
>
> So attempts have been made in the past but Thomas shot them down for
> being gross hacks (they were).
>
> But Thomas has now (mostly) completed rewriting the CPU hotplug
> machinery and he has at some point outlined means of achieving what
> you're after.
>
> I've added him to Cc so he can correct me where I'm wrong, as I've not
> looked into this in much detail after he mucked up all I knew about CPU
> hotplug.
>
> Since each CPU is now responsible for its own bootstrap, we can now kick
> all the CPUs awake without waiting for them to complete the online
> stage.
>
> There might however be code that assumes CPUs come up one at a time, so
> you'll need to audit for that. Its not going to be a trivial thing.

There are a couple of things to consider.

First of all we should make the whole 'kick CPU into life' and surrounding
magic generic. Every arch has it's own handshake mechanism.

That would look like this:

Step BP AP
0-9 [preparatory steps]

10 [kick cpu into life (arch callback)]
11 [Do initial arch bringup then
call in into a generic function ]
12 [handshake (generic)] [handshake (generic)]
13 [more arch specific magic] [more arch specific magic]

14-20 [ CPU starting ]
[ CPU goes online ]

40 [ CPU active, hotplug done ]

So the first step in parallelizing this would be:

for_each_present_cpu(cpu)
cpu_up(target_state = 10);

i.e. make the allocations and whatever preparatory work needs to be done
and kick the CPU into life. The target CPU would intialize the low level
stuff and then call into a generic function, which does the generic
initialization and then waits for the handshake.

So the next thing would be:

for_each_present_cpu(cpu)
cpu_up(target_state = 40);

This last step has to be single threaded for now because almost all CPU
hotplug using facilities rely on the current serialization. There are also
code pathes which use get_online_cpus() or cpu_hotplug_disable() to prevent
interaction with cpu hotplug.

The hotplug machinery is already designed so that after the handshake
(#12/13] a plugged CPU can bring up itself completely alone, but due to the
serialization expectations all over the place this won't work today.

To make it work, you have to go through every single instance of CPU
hotplug callback users and every single site which prevents hotplug via
get_online_cpus() or cpu_hotplug_disable() and audit them for concurrency
issues and fix them up.

There might also be interaction required with the state machine, i.e. stop
the state progress on a self plugging CPU between two steps to make
serialization work.

Thanks,

tglx