Re: [RFC/PATCH] SLQB: Mark the allocator as broken PowerPC and S390

From: Mel Gorman
Date: Thu Sep 17 2009 - 06:57:14 EST


On Thu, Sep 17, 2009 at 01:29:24PM +0300, Pekka Enberg wrote:
> Hi Mel,
>
> On Wed, Sep 16, 2009 at 09:37:39AM +0300, Pekka Enberg wrote:
> > > The SLQB allocator is known to be broken on certain PowerPC and S390
> > > configurations. Disable the allocator in Kconfig for those architectures
> > > until the issues are resolved.
> >
> > Can the issues be summarised?
>
> It's a boot time crash during module load:
>
> http://www.mail-archive.com/linuxppc-dev@xxxxxxxxxxxxxxxx/msg33092.html
>
> AFAICT, it's related to a memoryless node 0. Nick suggested it could be
> a latent bug in the kernel that's triggered by SLQB.
>

The danger is that this isn't a PPC or s390 bug then as such, but a bug where
there are either memoryless nodes or when node 0 is memoryless. Hence, there
is no guarantee that your Kconfig option will catch all instances where this
bug triggers. Granted, the configuration is most likely a PPC machine :)

> On Thu, 2009-09-17 at 11:08 +0100, Mel Gorman wrote:
> > The danger is if SLQB is being silently disabled, it'll never be noticed
> > or debugged :/
>
> Maybe, but that's not an excuse to push something that's known to break.
>

Wow, this is from back in May! Lame.

I'm against silently disabling it. Memoryless nodes are extremely rare but
bugs crop up there occasionally and take a long time to catch and squash. SLQB
breaking there is not going to cause widespread damage but force a fix to
be developed by the people with access to the affected machines.

> The other alternative is to skip this release cycle but I'm not sure
> what we'd gain with that. Nick already stated in private that he'll try
> to arrange for some time with ppc machines to debug the thing and we
> hope to be able to fix it by 2.6.32 final.
>

I have access to a ppc machine but not necessarily one with a memoryless nodes
that can reproduce this problem.

Assuming Sachin is the reporter and we are in the same company, maybe I
have access to the machine. Sachin, can you mail me privately what this
machine is called and lets see can I get some time on that machine? By
any chance, was this bisected or did it just show up when SLQB became
the default?

Total aside, does anybody know handily if fake NUMA support allows the
creation of memoryless nodes help reproducing problems like this? If I can't
get a real machine, that'll be the approach I'll be trying.

> Btw, the code is in slqb/core branch of slab.git in case someone wants
> to take a stab at fixing the bug.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/