Re: [git pull] cpus4096 fixes

From: Ingo Molnar
Date: Mon Jul 28 2008 - 03:56:43 EST



* Rusty Russell <rusty@xxxxxxxxxxxxxxx> wrote:

> On Monday 28 July 2008 13:06:36 Andrew Morton wrote:
> > On Mon, 28 Jul 2008 10:42:12 +1000 Rusty Russell <rusty@xxxxxxxxxxxxxxx>
> wrote:
> > > The 4k CPU patches have been sliding in without review up until now.
> >
> > wot?
>
> This surprises you? [...]

you should check many of the earliest iterations (it's all on lkml), and
the bits we rejected in review/testing. You'll be surprised how much
questionable and fragile stuff was filtered out.

But your intuition is right in a sense, this whole topic _feels_ ugly,
and there's a good reason for it and i doubt you'll like it:

Much of it derives from the ugly fact that cpumasks were designed to be
word-size-ish and are used as such in hundreds of places in the kernel,
while with 4K CPUs they become half a _kilobyte_.

That causes the basic conceptual friction. That fundamental unease is
what caused me to split these patches off into a completly separate
topic, so that they can be NAK-ed individually without blocking other
subsystem changes. Mike will be able to tell you how many bits were
rejected and rewritten - it's been one of the most iterated topics.

Unless you know some good way around that basic "0.5K cpumask" problem
[besides the 'dont try to do it at all then, stupid' solution] Mike's
painful year-long, multi-release, all-on-lkml effort to bootstrap a 4K
CPUs kernel, to track down dozens of early boot crashes, to look at
stack sizes in zillions of functions, to write a ton of patches to
evolve the APIs to cope with it better (all of this was done out in the
open on lkml for all to see) looks like quite close to what _can_ be
done.

128/256/512/1024 CPU support (which has been upstream for years and
built into enterprise distros, etc.) already turned cpumasks into rather
static objects in practice and their proliferation into hotpaths stopped
- so maybe we could just turn them into non-stack objects from now on.

( with perhaps some nice wrappers that turns then into on-stack objects
to not slow down the common case. Mike tried to do something like
that. )

Help and more cleanup patches welcome. Mike & co did most of the hard
work already, latest -git does boot with 4K cpus built into the kernel.
We can iterate this stuff a _lot_ easier now. Turn on CONFIG_MAXSMP=y on
x86 and you can boot it on your PC.

> [...] I stumbled across the cpumask_of_cpu() bug because I happened
> to want it for stop_machine and read the damned code. But it lead me
> to the surrounding code, which is pretty questionable. An
> arch-specific map, rather than depending on NR_CPUS? Adding
> set_cpus_allowed_ptr() instead of changing set_cpus_allowed()? [...]

the set_cpus_allowed_ptr() change too was done due to review feedback,
to reduce the friction with other tree, to make for smoother migration.
Breaking an existing API is a far too rude technique for a long-lived
topic like this. (it's been going on for nearly a year or so)

> [...] Macros which declare things and may or may not do an
> allocation/free? Finally a patch so horrifically ugly that it can't
> be ignored any more gets all the way to Linus.

[ hey, is that your suggested solution you are talking about? ;-) ]

> Overall, it seems like an attempt to sneak in gradual workarounds for
> cpumasks on the stack, rather than a coherent plan. I understand the
> temptation to avoid an "are we prepared to pay this price for large
> NR_CPUS?" discussion, but we need it anyway.

sure. From a practical standpoint 4096 CPUs support looks pretty stable
and functional. I boot a 4K cpus kernel every couple of minutes:

config-Sun_Jul_27_09_15_47_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_09_27_00_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_09_29_39_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_09_36_41_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_09_40_22_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_09_59_33_CEST_2008.good:CONFIG_MAXSMP=y

config-Sun_Jul_27_22_14_47_CEST_2008.good:CONFIG_NR_CPUS=8
config-Sun_Jul_27_22_20_09_CEST_2008.good:CONFIG_NR_CPUS=8
config-Sun_Jul_27_22_25_32_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_22_25_32_CEST_2008.good:CONFIG_NR_CPUS=4096
config-Sun_Jul_27_22_36_52_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_22_36_52_CEST_2008.good:CONFIG_NR_CPUS=4096
config-Sun_Jul_27_22_42_19_CEST_2008.good:CONFIG_MAXSMP=y
config-Sun_Jul_27_22_42_19_CEST_2008.good:CONFIG_NR_CPUS=4096
config-Sun_Jul_27_22_47_28_CEST_2008.good:CONFIG_NR_CPUS=32
config-Sun_Jul_27_22_52_47_CEST_2008.good:CONFIG_NR_CPUS=32
config-Sun_Jul_27_22_57_59_CEST_2008.good:CONFIG_NR_CPUS=32

The last difficult regression has been months ago. So this stuff is
hackable in practice and you can try out the end result if you are
interested in it.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/