Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs

From: Mel Gorman
Date: Thu Dec 19 2013 - 09:55:24 EST

Next message: Jean Delvare: "Re: [lm-sensors] [PATCH V1] fix adc to voltage calculation in da9052 power driver"
Previous message: Alexander Shishkin: "Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units"
In reply to: Mel Gorman: "Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Dec 12, 2013 at 12:00:37PM -0600, Alex Thorlton wrote:
> This patch changes the way we decide whether or not to give out THPs to
> processes when they fault in pages. The way things are right now,
> touching one byte in a 2M chunk where no pages have been faulted in
> results in a process being handed a 2M hugepage, which, in some cases,
> is undesirable. The most common issue seems to arise when a process
> uses many cores to work on small portions of an allocated chunk of
> memory.
>
> <SNIP>
>
> As you can see there's a significant performance increase when running
> this test with THP off. Here's a pointer to the test, for those who are
> interested:
>
> http://oss.sgi.com/projects/memtests/thp_pthread.tar.gz
>
> My proposed solution to the problem is to allow users to set a
> threshold at which THPs will be handed out. The idea here is that, when
> a user faults in a page in an area where they would usually be handed a
> THP, we pull 512 pages off the free list, as we would with a regular
> THP, but we only fault in single pages from that chunk, until the user
> has faulted in enough pages to pass the threshold we've set.

I have not read this thread yet so this is just me initial reaction to
just this part.

First, you say that the propose solution is to allow users to set a
threshold at which THPs will be handed out but you actually allocate all
the pages up front so it's not just that. There a few things in play

1. Deferred zeroing cost
2. Deferred THP set cost
3. Different TLB pressure
4. Alignment issues and NUMA

All are important. It is common for there to be fewer large TLB entries
than small ones. Workloads that sparsely reference data may suffer badly
when using large pages as the TLB gets trashed. Your workload could be
specifically testing for the TLB pressure (optimising point 3 above) in
which case the procesor used for benchmarking is a major factor and it's
not a universal win.

For example, your workload may optimise 3 but other workloads may suffer
because more faults are incurred until the threshold is reached, the
page tables must be walked to initialse the remaining pages and then the
THP setup and TLB flushed.

Keep these details in mind when measuring your patches if at all possible.

Otherwise, on the face of it this is actually a similar proposal to "page
reservation" described one of the more important large page papers written
by Talluri (http://dl.acm.org/citation.cfm?id=195531). Right now you could
consider Linux to be reserving pages with a promotion threshold of 1 and
you're aiming to alter that threshold. Seems like a reasonable idea that
will eventually work out even though I have not seen the implementation yet.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jean Delvare: "Re: [lm-sensors] [PATCH V1] fix adc to voltage calculation in da9052 power driver"
Previous message: Alexander Shishkin: "Re: [PATCH v0 04/71] itrace: Infrastructure for instruction flow tracing units"
In reply to: Mel Gorman: "Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]