NUMA API

From: Ulrich Drepper
Date: Fri Apr 30 2004 - 02:37:55 EST


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In the last weeks I have been working on designing a new API for a NUMA
support library. I am aware of the code in libnuma by ak but this code
has many shortcomings:

~ inadequate topology discovery
~ fixed cpu set size
~ no clear separation of memory nodes
~ no inclusion of SMT/multicore in the cpu hierarchy
~ awkward (at best) memory allocation interface
~ etc etc

and last but not least

~ a completely unacceptable library interface (e.g., global variables as
part of the API, WTF?)

At the end of the attached document is a comparison of the two APIs.


I'm only posting now about this since I wanted to get some sanity checks
of the API first. Some of our (i.e., Red Hat's) partners provided this.
They might identify themselves, or not. This is not because other
parties are meant to be excluded.


The API described here is meant to be a minimal which can be wrapped for
use in any kind of higher-level language (or even in another C library
using the interface). For this reason the CPU and memory node sets are
not handled by an abstract data type but instead as bitmap. Using an
abstract data types (in C) means restricting the way wrapper libraries
can be designed. In a C++ wrapper, for instance, the bit sets certainly
should be abstract. A later version of the attached document might try
provide higher-level interfaces.


The text of the API proposal is not yet polished. In fact, most
descriptions are fairly short. I want o get some more assurance that
the API is received well before spending significantly more time on it.

As specified, the implementation of the interface is designed with only
the requirements of a program on NUMA hardware in mind. I have paid no
attention to the currently proposed kernel extensions. If the latter do
not really allow implementing the functionality programmers need then it
is wasted efforts.

For instance, I think the way memory allocated in interleaved fashion is
not "ideal". Interleaved allocation is a property of a specific
allocation. Global states for processes (or threads) are a terrible way
to handle this and other properties since it requires the programmer to
constantly switch the mode back and forth since any part of the runtime
might be NUMA aware and reset the mode.

Also, the concept of hard/soft sets for CPUs is useful. Likewise
"spilling" over to other memory nodes. Usually using NUMA means hinting
the desired configuration to the system. It'll be used whenever
possible. If it is not possible (for instance, if a given processor is
not available) it is mostly no good idea to completely fail the
execution. Instead a less optimal resource should be used. For memory
it is hard to know how much memory on which node is in use etc.

Another missing feature in libnuma and the current kernel design is
support for changes in the configuration. CPUs might be added or
removed, likewise memory. Additional interconnects between NUMA blocks
might be added etc.


Overall I think the proposed API provides a architecture-independent,
future-safe NUMA API. If no program uses the kernel functionality
directly (which is possible with the API) the kernel interface can be
changed and adopted for each architecture or even specific machine
without the program noticing it.


The selection of names for the functions is by no means fixed. These
are proposals. I'm open for constructive criticism. In case you find
interfaces to be missing or wrong or not optimal, please let me know as
well. Once the API is regarded useful we can start thinking about the
kernel interface so keep these two things separated.


Please direct comments to me. In case there is interest I can set up a
separate mailing list since lkml is probably not the best venue.

- --
â Ulrich Drepper â Red Hat, Inc. â 444 Castro St â Mountain View, CA â
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAkgG92ijCOnn/RHQRAqgUAJ9bJ83LxSZ43TW5+5I1VhXV+zRPNACgnjmQ
SnFjDhA7v+5CGaZO5/jOxhw=
=93mp
-----END PGP SIGNATURE-----
Thoughts about a NUMA API

Ulrich Drepper
Red Hat, Inc.
Time-stamp: <2004-04-29 01:18:32 drepper>

*** Very early draft. I'll clean up the interface when I get some positive
*** feedback.


The technology used in NUMA machines is still evolving which means
that any proposed interface will fall short over time. We cannot
think about every possibility and nuance the hardware designers come
up with. The following is a list of assumptions made for this
document. Some assumptions will be too general for some
implementations which allows simplication. But the interfaces should
cover more designs.

1. Non-uniform resources are processors and memory

2. The address spaces of processors overlap

3. Possible measure: distance of processors

The distance is measured by the minimal difference of cost of
accessing memory.

~ SMT and multi-core (MC) processors share some processor cache;

~ processors on the same SMP node (which might just be one
procesor in size) have the same distance, which is larger
than SMT/MC distance

~ processors on different NUMA nodes increases in distance with each
interconnect which has to be used.

4. The machine's architecture can change over time.

~ hotplug CPUs/RAM

~ dynamically enabling/disabling parts of the machine based on
resource requirements


Example
=======

+----------------------------------------+
| +--------------+ +--------------+ |
| | +--+ +--+ | | +--+ +--+ | |
| | |T1| |T1| | | |T1| |T1| | |
| | +--+ +--+ | | +--+ +--+ | |
| | |T2| |T2| | | |T2| |T2| | |
| | +--+ +--+ | | +--+ +--+ | |
| | C1 C2 | | C1 C2 | |
| +------P1------+ +------P2------+ |
| |
| +------------------------------------+ |
| | M1 | |
| +------------------------------------+ |
+-------------------N1-------------------+

+----------------------------------------+
| +--------------+ +--------------+ |
| | +--+ +--+ | | +--+ +--+ | |
| | |T1| |T1| | | |T1| |T1| | |
| | +--+ +--+ | | +--+ +--+ | |
| | |T2| |T2| | | |T2| |T2| | |
| | +--+ +--+ | | +--+ +--+ | |
| | C1 C2 | | C1 C2 | |
| +------P1------+ +------P2------+ |
| |
| +------------------------------------+ |
| | M2 | |
| +------------------------------------+ |
+-------------------N2-------------------+

+----------------------------------------+
| +--------------+ +--------------+ |
| | +--+ +--+ | | +--+ +--+ | |
| | |T1| |T1| | | |T1| |T1| | |
| | +--+ +--+ | | +--+ +--+ | |
| | |T2| |T2| | | |T2| |T2| | |
| | +--+ +--+ | | +--+ +--+ | |
| | C1 C2 | | C1 C2 | |
| +------P1------+ +------P2------+ |
| |
| +------------------------------------+ |
| | M3 | |
| +------------------------------------+ |
+-------------------N3-------------------+

These are three NUMA blocks, each consisting of two SMP processors,
each of which has two cores which by itself have two threads. We use
the notation T1:C2:P2:N3 for the first thread, in the second core, in
the second processor, on the third node. The main memory in the nodes
is represented by M1, M2, M3.

A simplistics measure in this case could be: requiring to access the
next level of memory doubles the cost. So we might have the following
costs:

T1:C1:P1:N1 <-> T2:C1:P1:N1 == 1
T1:C1:P1:N1 <-> T1:C2:P1:N1 == 2
T1:C1:P1:N1 <-> T1:C1:P2:N1 == 4
T1:C1:P1:N1 <-> T1:C1:P1:N2 == 8
T1:C1:P1:N1 <-> T1:C1:P1:N3 == 16 (i.e., 2 * 8 since two interconnect
are used)

It might be better to compute the distance based on real memory access
costs. The above is just an example.

The above costs automatically take into account when the main memory
of a node has to be used or when data is shared in caches.

A second cost does not take the sharing of data between processors
into account but instead measures access to data stored in a specific
memory node.

T1:C1:P1:N1 -> M1 == 4
T1:C1:P1:N1 -> M2 == 8
T1:C1:P1:N1 -> M3 == 16

This cost can be derived from the more detailed CPU-to-CPU cost but
since there can be memory nodes without CPUs and often it is not
sharing data between CPUs (but instead access to stored memory) which
is important, this simplified cost is useful, too.


Interfaces
==========

The interfaces can be grouped:

1. Topology. Programs need to know about the machine's layout.

2. Placement/affinity

~ of execution
~ of memory allocation

3. Realignment: adjust placement/affinity to new situation

4: Temporal changes



Topology Interfaces
-------------------

Two different types of information must be accessible:

1. enumeration of the memory hierarchies

This includes SMT/MC

2. distance


The fundamental data type is a bitset with each bit representing a
processor. glibc defines cpu_set_t. The size is arbitrarily large.
We might introduce interfaces to dynamically allocate them. For now,
cpu_set_t is a fixed-size type.

CPU_SETSIZE number of processors in cpu_set_t

CPU_SET_S(cpu, setsize, cpuset) set bit corresponding to CPU in CPUSET
CPU_CLR_S(cpu, setsize, cpuset) clear bit corresponding to CPU in CPUSET
CPU_ISSET_S(cpu, setsize, cpuset) check whether bit corresponding to CPU is set
CPU_ZERO_S(setsize, cpuset) clear set

CPU_EQUAL_S(setsize1, cpuset1, setsize2, cpuset2)
Check whether the set bits in the two
sets match.
CPU_EQUAL(cpuset1, cpuset2) CPU_EQUAL_S(sizeof(cpu_set_t), cpuset1,
sizeof(cpu_set_t), cpuset2)


CPU_SET(cpu, cpuset) CPU_SET_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_CLR(cpu, cpuset) CPU_CLR_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_ISSET(cpu, cpuset) CPU_ISSET_S(cpu, sizeof(cpu_set_t), cpuset)
CPU_ZERO(cpuset) CPU_ZERO_S(sizeof(cpu_set_t), cpuset)


We probably need the following:

CPU_AND_S(destsize, destset, setsize, srcset1, set2size, srcset2)

logical AND of srcset1 and srcset2, place result in destsrc. Might be
one of the source sets

CPU_OR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
logical OR of srcset1 and srcset2, place result in destsrc. Might be
one of the source sets

CPU_XOR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
logical XOR of srcset1 and srcset2, place result in destsrc. Might be
one of the source sets

For dynamic allocation:

__cpu_mask type for array element in cpu_set_t

CPU_ALLOC_SIZE(count) number of bytes needed to represent cpu_set_t
which can at least represent CPU number COUNT

CPU_ALLOC(count) allocate cpu_set_t which can represent at least
represent CPU number COUNT

CPU_FREE(cpuset) free CPU set previously allocated with CPU_ALLOC()


Maybe interfaces to iterate over set are useful (C++ interface).


A similar type is defined for the representation of memory nodes.
Each processor is associated with one memory node and each memory node
can have zero or more processors associated.

memnode_set_t basic type

MEMNODE_SET_S(node, memnodesize, memnodeset)
set bit corresponding bit to NODE in MEMNODESET
MEMNODE_CLR_S(node, memnodesize, memnodeset)
clear bit corresponding to NODE in MEMNODESET
MEMNODE_ISSET_S(node, memnodesize, memnodeset)
check whether bit corresponding to NODE is set
MEMNODE_ZERO_S(memnodesize, memnodeset) clear set

MEMNODE_EQUAL_S(setsize1, memnodeset1, setsize2, memnodeset2)
Check whether the set bits in the two
sets match.
We probably need the following:

MEMNODE_AND_S(destsize, destset, setsize, srcset1, set2size, srcset2)

logical AND of srcset1 and srcset2, place result in destsrc. Might be
one of the source sets

MEMNODE_OR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
logical OR of srcset1 and srcset2, place result in destsrc. Might be
one of the source sets

MEMNODE_XOR_S(destsize, destset, set1size, srcset1, set2size, srcset2)
logical XOR of srcset1 and srcset2, place result in destsrc. Might be
one of the source sets

For dynamic allocation:

__memnode_mask type for array element in memnode_set_t

MEMNODE_ALLOC_SIZE(count) see CPU_ALLOC_SIZE()

MEMNODE_ALLOC(count) see CPU_ALLOC()

MEMNODE_FREE(cpuset) see CPU_FREE()


To determine the topology:


int NUMA_cpu_count(unsigned int *countp)

Return in *COUNTP the number of online CPUs. The
sysconf(_SC_NPROCESSORS_ONLN) information might be sufficient, though.
Returning an error can signal that the NUMA support is not present.


int NUMA_cpu_all(size_t destsize, cpu_set_t dest)

Set bits for all (currently) available CPUs.


int NUMA_cpu_self(size_t destsize, cpu_set_t dest)

Set bit for current processor

int NUMA_cpu_self_idx(void)

Return index in cpu_set_t for current processor.


int NUMA_cpu_at_level(size_t destsize, cpu_set_t dest, size_t srcsize,
const cpu_set_t src, int level)

Fill DEST with the bitmap which has all the bits corresponding to
processors which are (currently) LEVEL or less levels away from any
processor in SRC.

In the simplest case on bit is set in SRC. Level 1 might be used
to find out all SMT siblings. If more than one bit is set more the
search is started from all of them.

NB: the is "level", not "distance". Since the distance could be
relative to the access cost there need not be sequential values which
can be used in iteration. With this interface we can go on incrementing
level until no further processors is found which could be signalled
by an return value.


int NUMA_cpu_distance(int *minp, int *maxp, size_t setsize,
const cpu_set_t set)

Determine the minimum and maximum distance between nodes in SET.

This is the distance which is a measure for the cost of sharing
memory.

Usually two bits are set. If more bits are set the spread between min
and max is useful.


int NUMA_mem_main_level(int cpuidx, int *levelp)

Return in *LEVELP the level where the local main memory for processor
CPUIDX is.


int NUMA_memnode_count(unsigned int *countp)

Return in *COUNTP the number of online memnodes.


int NUMA_memnode_all(size_t destsize, memnode_set_t dest)

Set bits for all (currently) available memnodes.


int NUMA_cpu_to_memnode(size_t cpusetsize, const cpu_set_t cpuset,
size_t memnodesize, memnode_set_t memnodeset)

Set bits in MEMNODESET which correspond to memory node which are local
to any of the CPUs represented by bits set in CPUSET.


int NUMA_memnode_to_cpu(size_t memnodesize, const memnode_set_t memnodeset,
size_t cpusetsize, cpu_set_t cpuset)

Set bits in CPUSET which correspond to CPUs which are local
to any of the memory nodes represented by bits set in MEMNODESET.


int NUMA_mem_distance(int *minp, int *maxp, void *ptr, size_t setsize,
const memnode_set_t set)

Determine the minimum and maximum level difference to the memory pointed
to by Ptr from any of the CPUs in SET.

Usually one bit is set in SET.

If the difference between *MINP and the value returned from
NUMA_mem_main_level() is zero, the memory is local to at least one CPU
in set. If the difference between *MAXP and the NUMA_mem_main_level()
value is zero, the memory is local to all CPUs.


int NUMA_cpu_mem_cost(int *minp, int *maxp, size_t cpusetsize,
const cpu_set_t cpuset, size_t memsetsize,
const memnode_set_t memnodeset)

Compute minimum and maximum access costs of processors in CPUSET to
any of the memory nodes in MEMNODESET.


Example: Determine CPUs on neighbor "nodes"

cpu_set_t level0;
CPU_SET(level0, the_cpu);

cpu_set_t levelN;
NUMA_cpu_at_level(levelN, level0, N);

cpu_set_t levelNp1;
NUMA_cpu_at_level(levelNp1, level0, N + 1);

CPU_XOR(levelNp1, levelNp1, levelN);

Given a CPU index, the CPUs at leavel N are determined, then those at
level N+1. The difference (XOR) is the set of processors at level N+1
from the given CPU.



Memory Information
------------------


It is necessary to know something about the memory at a given level.
For instance, level 1 might be "level 1 CPU cache", level 4 might be
"main memory".

int NUMA_mem_info_size(int level, int cpuidx, NUMA_size_t *size)
int NUMA_mem_info_associativity(int level, int cpuidx, NUMA_size_t *size)
int NUMA_mem_info_linesize(int level, int cpuidx, NUMA_size_t *size)

*_size applies to all kinds of memory. _associativity and _linesize
mainly apply to caches. Maybe it's useful for main memory, too. If not
an error could be returned.


int NUMA_mem_total(int memnodeidx, NUMA_size_t *size)
int NUMA_mem_avail(int memnodeidx, NUMA_size_t *size)

The total memory and available memory on memory node MEMNODEIDX.


Placement/Affinity Interfaces
-----------------------------

int NUMA_mem_set_home(pid_t pid, size_t setsize, memnode_set_t set)

Install SET as mask of preferred nodes for memory allocation for process
PID. This applies only to directly attached memory (NUMA_mem_main_level()).
If more than one bit is set in SET the memory allocation can be bread
accross all the local memory for the CPUs in the set.

int NUMA_mem_get_home(pid_t pid, size_t setsize, memnode_set_t set)

Return currently installed prefferred node set.


int NUMA_mem_set_home_thread(pthread_t th, size_t setsize, memnode_set_t set)

Similar, but limited to the given thread.

int NUMA_mem_get_home_thread(pthread_t th, size_t setsize, memnode_set_t set)

Likewise to retrieve the information.


int NUMA_aff_set_cpu(pid_t pid, size_t setsize, cpu_set_t set, int hard)

Set affinity mask for process PID to the processors in SET. There are
two masks: the hard and the soft. No processor not in the hard mask
can ever be used. The soft mask is a recommendation.

int NUMA_aff_get_cpu(pid_t pid, size_t setsize, cpu_set_t set, int hard)

The corresponding interface to get the data.

int NUMA_aff_set_cpu_thread(pthread_t th, size_t setsize, cpu_set_t set,
int hard)

Similar to NUMA_aff_set_cpu() but for the given thread.

int NUMA_aff_get_cpu_thread(pthread_t th, size_t setsize, cpu_set_t set,
int hard)

Get the data.


The "hard" variants are basically the existing sched_setaffinity and
pthread_setaffinity. The soft and hard maps are maintained separately.


void *NUMA_mem_alloc_local(NUMA_size_t size, int spill, int interleave)

Allocate SIZE bytes local to the current process, regardless of the
registered preferred memory node mask. Unless SPILL is nonzero the
allocation fails if no memory available locally. If SPILL is nonzero
memory at greater distances is considers. This is a convenience
interface, it could be implemented using NUMA_mem_alloc() below.

Possible extension: SPILL could specify how far away the memory can
be spilled. For instance, the value 1 could mean one NUMA node way,
2 for up to 2 NUMA nodes away etc.

If INTERLEAVE is nonzero the memory is allocated in interleaved form
from all the nodes specified. Otherwise all memory comes from one node.


void *NUMA_mem_alloc_preferred(NUMA_size_t size, int spill, int interleave)

Allocate memory according to the mask registered with NUMA_aff_mem_home
or NUMA_aff_mem_home_thread. SPILL and INTERLEAVE are handled as in
NUMA_mem_alloc_local.


void *NUMA_mem_alloc(NUMA_size_t size, size_t setsize, memnode_set_t set,
int spill, int interleave)

Allocate memory on any of the nodes in SET


What chunks of memory can be allocated is debateble. It might make sense
to restrict all sizes to page size granularity. Or at least round all
values up.

??? Should the granularity be configurable ???


void NUMA_mem_free(void *)

Obviously, free the memory.


int NUMA_mem_get_nodes(void *addr, size_t destsize, memnode_set_t dest)

The function will set the bits in DEST which represent processors
which are local to the memory pointed to by ADDR.


int NUMA_mem_bind(void *addr, size_t size, size_t setsize, memnode_set_t set,
int spill)

The memory in the range of [addr,addr+size) in the current process is bound
to one of the nodes represented in SET. Unless SPILL is nonzero the
call will fail if no memory is available on the nodes.


int NUMA_mem_get_nodes(void *addr, size_t size,
size_t destsize, memnode_set_t dest)

The function returns information about the nodes on which the memory
in the range [addr,addr+size) is allocated. If the memory is not continously
allocated (or in case of multi-threaded or multi-core processors) this can
mean more than one bit is set in the result set.


Realignment
-----------

CPU sets can be realligned at any time using NUMA_aff_cpu() etc.


void *NUMA_mem_relocate(void *ptr, size_t setsize, memnode_set_t set)

relocate the content of the memory pointed to be PTR to a node in SET.
Return the new address.


Temporal Changes
----------------

The machine configuration can change over time. New processors can
come online, others go offline, memory banks are switched on or off.
The above interfaces return information about the currently active
configuraiton. There is possibly the danger that data sets from
different configurations are used.

One solution would be to require an open()-like function which
retrieves all the information in one step and all the interfaces
mentioned above will use that cached data. The problem with this is
that if the configuration changes the decision made using the cached
data is outdated. Second, the amount of data which is needed can be
big or, more likely, expensive to get even though only parts of the
information are used.

A different possibility would be to provide a simple callback which
returns a unique ID for each configuration. Any use of the topology
interfaces would then start and end with a call to function to get the
ID. If the two values differ, the collected data is inconsistent.
This would eliminate the second problem mentioned above, but not the
first.

A third possibility is to register a signal handler with the kernel so
that the kernel can send a signal whenever the configuration changes.
Alternatively, a /proc/file or netlink socket could be used to signal
interested parties (who then could send a signal if necessary). Using
d-bus is posssible, too. This notification could not only be used to
notice changes while reading the topology, it could also get a process
at any time to reconsider the current decision and reorganize the
processor/memory usage.

From these possibilities the d-bus route seems to be the most
appealing since d-bus already receives this kind of information from
the kernel and any number of processes can receive them.


Comparison with libnuma
=======================

nodemask_t:

Unlike nodemask_t, cpu_set_t is already in use in glibc. The affinity
interfaces use it so there is not need to duplicate the functionality
and no need to define other versions of the affinity interfaces.

Furthermore, the nodemask_t type is of fixed size. The cpu_set_t
has a convenience version which is of fixed size but can be of
arbitrary size. This is important has a bit of math shows:

Assume a processor with four cores and 4 threads each

Four such processors on a single NUMA node

That's a total of 64 virtual processors for one node. With 32 such
nodes the 1024 processors of cpu_set_t would be filled. And we do
not want to mention the total of 64 supported processors in libnuma's
nodemask_t. To be future safe the bitset size must be variable.


In addition there is the type memnode_set_t which represents memory node.
It is possible to have memory nodes without processors so only a cpu_set_t
is not sufficient.


nodemask_zero() --> CPU_ZERO() which is already in glibc
nodemask_set() --> CPU_SET() ditto
nodemask_clr() --> CPU_CLR() ditto
nodemask_isset() --> CPU_ISSET() ditto

nodemask_equal() --> CPU_EQUAL()

Plus the appropriate macros to handle memnode_t.


numa_available() --> NUMA_cpu_count() for instance

numa_max_node() --> either NUMA_cpu_count()
or NUMA_CPU_all()

numa_homenode() --> NUMA_mem_get_home() or NUMA_aff_get_cpu()
or NUMA_aff_get_cpu_thread() or NUMA_cpu_self()

The concept of a never-changing home node strikes me as odd. Especially
with hot-swap CPUs. Declaring one or more CPUs the home nodes is fine.
The default can be cpu the thread started on.


numa_node_size() --> NUMA_mem_avail()

The main memory is at level NUMA_mem_main_level()

numa_pagesize() --> nothing yet since useless

It is not clear to me what this really should to. I.e., the interface
of numa_pagesize() seems useless. With no argument, the pagesize which
can be determine is the pagesize of the system. When hugepages etc
come into play it is necessary to provide a pointer to a memory address
so it can be determined which kind of memory it is.

??? Should we add NUMA_size_t NUMA_pagesize(void *addr) ???


numa_all_nodes --> global variables are *EVIL*

Use NUMA_cpu_all()

numa_no_nodes --> global variables are *EVIL*

cpu_set_t s;
CPU_ZERO(s);

numa_bind() --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()
or NUMA_aff_set_cpu(() or NUMA_aff_set_cpu_thread()

numa_bind() misses A LOT of flexibility. First, memory and CPU need
node be the same nodes. Second, thread handling is missing. Third,
hard versus soft requirements are not handled for CPU usage.


numa_set_interleave_mask() --> see comment
numa_get_interleave_mask() --> see comment
numa_get_interleave_node() --> see comment
numa_alloc_interleaved_subset() --> see comment
numa_alloc_interleaved() --> see comment
numa_interleave_memory() --> see comment

I do not think that interleaving should be a completely separate mechanism
next to normal memory allocation. Instead it is a logical extension of
memory allocation. Interleaving is a parameter for the memory allocation
functions like NUMA_mem_alloc().


numa_set_homenode() --> NUMA_mem_set_home() or NUMA_aff_set_cpu()
or NUMA_aff_set_cpu_thread() or NUMA_cpu_self()

numa_set_localalloc() --> NUMA_mem_set_home() or NUMA_mem_set_home_thread()


numa_set_membind() --> NUMA_mem_bind()

numa_get_membind() --> NUMA_get_nodes()


numa_alloc_onnode() --> NUMA_mem_alloc()

numa_alloc_local() --> NUMA_mem_alloc_local()

numa_alloc() --> NUMA_mem_alloc_preferred()

numa_free() --> NUMA_mem_free()

numa_tonode_memory() --> NUMA_mem_relocate()

numa_setlocal_memory() --> NUMA_mem_relocate()

numa_police_memory() --> nothing yet

I don't see why this is necessary. Yes, address space allocation and
the actual allocation of memory are two steps. But this should be
taken case of by the allocation functions (if necessary). To support
memory allocation with other interfaces then those described here and
magically treat them in the "NUMA-way" seems dumb.


numa_run_on_node_mask() --> NUMA_aff_set_cpu() or NUMA_aff_set_cpu_thread()

numa_run_on_node() --> NUMA_aff_set_cpu() or NUMA_aff_set_cpu_thread()


numa_set_bind_policy() --> too coarse grained

This cannot be a process property. And it must be possible to change
it from another thread, so the interface is completely broken. Beside,
it seems much more useful to differentiate between hard and soft masks
since this allows, if necessary, to spill over to other nodes. The
NUMA_aff_set_cpu() and NUMA_aff_set_cpu_thread() allow specifying
two masks.