Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

From: Peter Zijlstra
Date: Wed Nov 11 2020 - 15:00:33 EST


On Wed, Nov 11, 2020 at 06:26:20PM +0000, Matthew Wilcox wrote:
> On Wed, Nov 11, 2020 at 06:22:53PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 11, 2020 at 04:38:48PM +0000, Matthew Wilcox wrote:
> > > if (pud_leaf(pud))
> > > return PUD_SIZE;
> >
> > But that doesn't handle non-pagetable aligned hugetlb sizes. Granted,
> > that's unlikely at the PUD level, but why be inconsistent..
> >
> > So we really want:
> >
> > if (p*d_leaf(p*d)) {
> > if (!'special') {
> > page = p*d_page(p*d);
> > if (PageHuge(page))
> > return page_size(compound_head(page));
> > }
> > return P*D_SIZE;
> > }
>
> Still doesn't work because pages can be mapped at funny offsets.

Wait, what?! Is there hardware that has unaligned TLB page-sizes?

Can you start a 64K page at an 8k offset? I don't think I've ever seen
that. Still even with that, how would the above go wrong there? It would
find the compound page covering @addr, PageHuge() (and possibly some
addition arch specific condition) returns true and we get the compound
size to find the hardware page size used.

> What we really want is for a weak definition of
>
> unsigned long tlb_size(struct mm_struct *mm, unsigned long addr)
> {
> if (p*d_leaf(p*d))
> return p*d_size(p*d);
> }
>
> then ARM can look at its special bit in the page table to determine
> whether this is a singleton or part of a brace of pages.

That's basically what we provide. but really the only thing that's
missing from this generic page walker is the ability to detect if a
!PageHuge compound page is actually still a hardware page.

> > Now, when you add !PMD THP sizes (presumably for architectures that have
> > 'funny' sizes, otherwise what's the point), then you get to add '||
>
> This is the problem with all the huge page support in Linux today.
> It's written by people who work for hardware companies who think only
> about exploiting the hardware features they sell. You all ignore the
> very real software overhedas of trying to manage millions of pages.
> I see a 6% reduction in kernel overhead when running kernbench using
> THPs that may go as large as 256kB. On x86. Intel x86, at that.

That's a really nice improvement. However then this code doesn't care
about it. Please make it possible to distinguish between THP on hardware
pages vs software pages.