Re: [this_cpu_xx V4 13/20] this_cpu_ops: page allocator conversion

From: Mel Gorman
Date: Mon Oct 05 2009 - 05:46:44 EST


On Fri, Oct 02, 2009 at 01:39:28PM -0400, Christoph Lameter wrote:
> On Fri, 2 Oct 2009, Mel Gorman wrote:
>
> > On Thu, Oct 01, 2009 at 05:25:34PM -0400, cl@xxxxxxxxxxxxxxxxxxxx wrote:
> > > Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.
> > >
> > > This drastically reduces the size of struct zone for systems with large
> > > amounts of processors and allows placement of critical variables of struct
> > > zone in one cacheline even on very large systems.
> > >
> >
> > This seems reasonably accurate. The largest shrink is on !NUMA configured
> > systems but the NUMA case deletes a lot of pointers.
>
> True, the !NUMA case will then avoid allocating pagesets for unused
> zones. But the NUMA case will have the most benefit since the large arrays
> in struct zone are gone.

Indeed. Out of curiousity, has this patchset been performance tested? I
would expect there to be a small but measurable improvement. If there is
a regression, it might point to poor placement of read/write fields in
the zone.

> Removing the pagesets from struct zone also
> increases the cacheability of struct zone information. This is
> particularly useful since the size of the pagesets grew with the addition
> of the various types of allocation queues.
>
> > > Another effect is that the pagesets of one processor are placed near one
> > > another. If multiple pagesets from different zones fit into one cacheline
> > > then additional cacheline fetches can be avoided on the hot paths when
> > > allocating memory from multiple zones.
> > >
> >
> > Out of curiousity, how common an occurance is it that a CPU allocate from
> > multiple zones? I would have thought it was rare but I never checked
> > either.
>
> zone allocations are determined by their use. GFP_KERNEL allocs come from
> ZONE_NORMAL whereas typical application pages may come from ZONE_HIGHMEM.
> The mix depends on what the kernel and the application are doing.
>

I just wouldn't have expected a significant enough mix to make a measurable
performance difference. It's no biggie.

> > > pcp = &pset->pcp;
> > > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > migratetype = get_pageblock_migratetype(page);
> > > set_page_private(page, migratetype);
> > > local_irq_save(flags);
> > > @@ -1133,7 +1133,6 @@ static void free_hot_cold_page(struct pa
> > >
> > > out:
> > > local_irq_restore(flags);
> > > - put_cpu();
> >
> > Previously we get_cpu() to be preemption safe. We then disable
> > interrupts and potentially take a spinlock later.
>
> Right. WE need to move the local_irq_save() up two lines.

Just so I'm 100% clear, IRQ disabling is considered a preemption point?

> Why disable
> preempt and two instructions later disable interrupts? Isnt this bloating
> the code?
>

By and large, IRQs are disabled at the last possible moment with the minimum
amount of code in between. While the current location does not make perfect
sense, it was probably many small changes that placed it like this each
person avoiding IRQ-disabling for too long without considering what the cost
of get_cpu() was.

Similar care needs to be taken with the other removals of get_cpu() in
this patch to ensure it's still preemption-safe.

> > this_cpu_ptr() looks up PCP
> > disable interrupts
> > enable interrupts
>
> Move disable interrupts before the this_cpu_ptr?
>

In this case, why not move this_cpu_ptr() down until its first use just
before the if (cold) check? It'll still be within the IRQ disabling but
without significantly increasing the amount of time the IRQ is disabled.

> > > +/*
> > > + * Allocate per cpu pagesets and initialize them.
> > > + * Before this call only boot pagesets were available.
> > > + * Boot pagesets will no longer be used after this call is complete.
> >
> > If they are no longer used, do we get the memory back?
>
> No we need to keep them for onlining new processors.
>

That comment would appear to disagree.

> > > - * as it comes online
> > > + for_each_possible_cpu(cpu) {
> > > + struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> > > +
> > > + setup_pageset(pcp, zone_batchsize(zone));
> > > +
> > > + if (percpu_pagelist_fraction)
> > > + setup_pagelist_highmark(pcp,
> > > + (zone->present_pages /
> > > + percpu_pagelist_fraction));
> > > + }
> > > + }
> >
> > This would have been easier to review if you left process_zones() where it
> > was and converted it to the new API. I'm assuming this is just shuffling
> > code around.
>
> Yes I think this was the result of reducing #ifdefs.
>

Ok.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/