Re: NUMA API for Linux

From: Andi Kleen
Date: Wed Apr 07 2004 - 20:34:02 EST


On Wed, 07 Apr 2004 17:58:23 -0700
Matthew Dobson <colpatch@xxxxxxxxxx> wrote:


> Is there a reason you don't have a case for MPOL_PREFERRED? You have a
> comment about it in the function, but you don't check the nodemask isn't
> empty...

Empty prefered is a special case. It means DEFAULT. This is useful
when you have a process policy != DEFAULT, but want to set a specific
VMA to default. Normally default in a VMA would mean use process policy.


> In this function, why do we care what bits the user set past
> MAX_NUMNODES? Why shouldn't we just silently ignore the bits like we do
> in sys_sched_setaffinity? If a user tries to hand us an 8k bitmask, my
> opinion is we should just grab as much as we care about (MAX_NUMNODES
> bits rounded up to the nearest UL).

This is to catch uninitialized bits. Otherwise it could work on a kernel
with small MAX_NUMNODES, and then suddenly fail on a kernel with bigger
MAX_NUMNODES when a node isn't online.


> This seems a bit strange to me. Instead of just allocating a whole
> struct zonelist, you're allocating part of one? I guess it's safe,
> since the array is meant to be NULL terminated, but we should put a note
> in any code using these zonelists that they *aren't* regular zonelists,
> they will be smaller, and dereferencing arbitrary array elements in the
> struct could be dangerous. I think we'd be better off creating a
> kmem_cache_t for these and using *whole* zonelist structures.
> Allocating part of a well-defined structure makes me a bit nervous...

And that after all the whining about sharing policies? ;-) (a BIND policy will
always carry a zonelist). As far as I can see all existing zonelist code
just walks it until NULL.

I would not be opposed to always using a full one, but it would use considerably
more memory in many cases.


> I'm guessing this is why you aren't checking MPOL_PREFERRED in
> check_policy()? So the user can call mbind() with MPOL_PREFERRED and an
> empty nodes bitmap and get the default behavior you mentioned in the
> comments?

Yep.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/