Re: Purpose of numa_node?

From: Yinghai Lu
Date: Thu Jan 31 2008 - 16:30:28 EST


On Jan 31, 2008 5:42 AM, Brice Goglin <Brice.Goglin@xxxxxxxx> wrote:
> Paul Mundt wrote:
> > On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
> >
> >> While pondering ways to optimize I/O and swapping on large NUMA machines, I
> >> noticed that the numa_node field in struct device isn't actually used
> >> anywhere. We just have a couple dozen lines of code to conditionally
> >> create a sysfs file that will always return -1. Is anyone even working on
> >> code to actually use this field? I think it's a good piece of information
> >> to keep track of, so I'm not suggesting we remove it, but I want to make
> >> sure I'm not stepping on toes or duplicating effort if I try to make it
> >> useful.
> >>
> > It's manipulated with accessors. If you look at the users of
> > dev_to_node()/set_dev_node() you can see where it's being used. It's
> > primarily used in allocation paths for node locality, and the existing
> > set_dev_node() callsites are places where node locality information
> > already exists (ie, which node a given controller sits on). You can see
> > this in places like PCI (pcibus_to_node()) and USB, with node allocation
> > hints used in places like the dmapool and skb alloc paths.
> >
> > The in-kernel use looks perfectly sane in that regard, though I'm not
> > sure what the point of exporting this as a RO attribute to userspace is.
> > Presumably someone has a tool somewhere that cares about this.
> >
>
> I added the numa_node sysfs attribute in the beginning to make it easier
> to bind processes near some devices. So yes I have some user-space tool
> using it. It is much easier to use than the local_cpus field on large
> machines, especially when you use the libnuma interface to bind things,
> since you don't have to translate numa_node from/to cpumasks.
>
> It works fine on regular machines such as dual opterons. However, I
> noticed recently that it was wrong on some quad-opteron machines (see
> http://marc.info/?l=linux-pci&m=119072400008538&w=2) because something
> is not initialized in the right order. But I haven't tested 2.6.24 on
> this hardware yet, and I don't know if things have changed regarding this.

that will depend if you dsdt have _PXM for your pci root bus.
otherwise you will get all -1

I have a patchset locally that it call bus_numa, can get that from pci
conf space for AMD64 based machine.
so you can use that for AMD64 system without _PXM for pci root bus or
even with acpi=off.

let me know if you want test it.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/