Re: [PATCH 13/35] autonuma: add page structure fields

From: Andrea Arcangeli
Date: Tue Jun 19 2012 - 14:07:43 EST


Hi everyone,

On Tue, Jun 05, 2012 at 04:51:23PM +0200, Andrea Arcangeli wrote:
> The details of the solution:
>
> struct page_autonuma {
> short autonuma_last_nid;
> short autonuma_migrate_nid;
> unsigned int pfn_offset_next;
> unsigned int pfn_offset_prev;
> } __attribute__((packed));
>
> page_autonuma can only point to a page that belongs to the same node
> (page_autonuma is queued into the
> NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
> src_nid is the source node that page_autonuma belongs to, so all pages
> in the autonuma_migrate_head[src_nid] lru must come from the same
> src_nid. So the next page_autonuma in the list will be
> lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
> page_autonuma->pfn_offset_next)) etc..
>
> Of course all list_add/del must be hardcoded specially for this, but
> it's not a conceptually difficult solution, just we can't use list.h
> and stright pointers anymore and some conversion must happen.

So here the above idea implemented and working fine (it seems...?!? it
has been running only for half an hour but all benchmark regression
tests passed with the same score as before and I verified memory goes
in all directions during the bench, so there's good chance it's ok).

It actually works even if a node has more than 16TB but in that case
it will WARN_ONCE on the first page that is migrated at an offset
above 16TB from the start of the node, and then it will continue
simply skipping migrating those pages with a too large offset.

Next part coming is the docs of autonuma_balance() at the top of
kernel/sched/numa.c and cleanup the autonuma_balance callout location
(if I can figure how to do an active balance on the running task from
softirq). The location at the moment is there just to be invoked after
load_balance runs so it shouldn't make a runtime difference after I
clean it up (hackbench already runs identical to upstream) but
certainly it'll be nice to microoptimize away a call and a branch from
the schedule() fast path.

After that I'll write Documentation/vm/AutoNUMA.txt and I'll finish
the THP native migration (the last one assuming nobody does it before
I get there, if somebody wants to do it sooner, we figured the locking
details with Johannes during the MM summit but it's some work to
implement it).

===