Re: [PATCH 18/31] mm: sched: numa: Implement constant, per taskWorking Set Sampling (WSS) rate

From: Mel Gorman
Date: Thu Nov 15 2012 - 05:27:44 EST

Next message: Vineet Gupta: "Re: [RFC Patch v1 37/55] ARC: dynamic loadable module support"
Previous message: Vasilis Liaskovitis: "[RFC PATCH v2 3/3] acpi_memhotplug: Add prepare_remove operation"
In reply to: Andrew Theurer: "Re: [PATCH 18/31] mm: sched: numa: Implement constant, per taskWorking Set Sampling (WSS) rate"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Nov 14, 2012 at 01:39:53PM -0600, Andrew Theurer wrote:
> > > <SNIP>
> > >
> > > I am wondering if it would be better to shrink the scan period back to a
> > > much smaller fixed value,
> >
> > I'll do that anyway.
> >
> > > and instead of picking 256MB ranges of memory
> > > to mark completely, go back to using all of the address space, but mark
> > > only every Nth page.
> >
> > It'll still be necessary to do the full walk and I wonder if we'd lose on
> > the larger number of PTE locks that will have to be taken to do a scan if
> > we are only updating every 128 pages for example. It could be very expensive.
>
> Yes, good point. My other inclination was not doing a mass marking of
> pages at all (except just one time at some point after task init) and
> conditionally setting or clearing the prot_numa in the fault path itself
> to control the fault rate.

That's a bit of a catch-22. You need faults to control the scan rate
which determines the fault rate.

One thing that could be done is that the PTE scanning-and-updating is
rate limited if there is an excessive number of migrations due to NUMA
hinting faults within a given window. I've prototyped something along
these lines. The problem is that it'll disrupt the accuracy of the
statistics gathered by the hinting faults.

> The problem I see is I am not sure how we
> "back-off" the fault rate per page.

I went for a straight cutoff. If a node has migrated too much recently,
no PTEs are marked for update if the PTE points to a page on that node. I
know it's a big heavy hammer but it'll indicate if it's worthwhile.

> You could choose to not leave the
> page marked, but then you never get a fault on that page again, so
> there's no good way to mark it again in the fault path for that page
> unless you have the periodic marker.

In my case, the throttle window expires and it goes back to scanning at
the normal rate. I've changed the details of how the scanning rate
increases and decreases but how exactly is not that important right now.

> However, maybe a certain number of
> pages are considered clustered together, and a fault from any page is
> considered a fault for the cluster of pages. When handling the fault,
> the number of pages which are marked in the cluster is varied to achieve
> a target, reasonable fault rate. Might be able to treat page migrations
> in clusters as well... I probably need to think about this a bit
> more....
>

FWIW, I'm wary of putting too many smarts into how the scanning rates are
adapted. It'll be too specific to workloads and machine sizes.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Vineet Gupta: "Re: [RFC Patch v1 37/55] ARC: dynamic loadable module support"
Previous message: Vasilis Liaskovitis: "[RFC PATCH v2 3/3] acpi_memhotplug: Add prepare_remove operation"
In reply to: Andrew Theurer: "Re: [PATCH 18/31] mm: sched: numa: Implement constant, per taskWorking Set Sampling (WSS) rate"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]