Re: [PATCH V9 1/4] perf/core: Add PERF_SAMPLE_DATA_PAGE_SIZE

From: Matthew Wilcox
Date: Wed Nov 11 2020 - 20:31:39 EST


On Wed, Nov 11, 2020 at 09:00:00PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 11, 2020 at 06:26:20PM +0000, Matthew Wilcox wrote:
> > On Wed, Nov 11, 2020 at 06:22:53PM +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 11, 2020 at 04:38:48PM +0000, Matthew Wilcox wrote:
> > > > if (pud_leaf(pud))
> > > > return PUD_SIZE;
> > >
> > > But that doesn't handle non-pagetable aligned hugetlb sizes. Granted,
> > > that's unlikely at the PUD level, but why be inconsistent..
> > >
> > > So we really want:
> > >
> > > if (p*d_leaf(p*d)) {
> > > if (!'special') {
> > > page = p*d_page(p*d);
> > > if (PageHuge(page))
> > > return page_size(compound_head(page));
> > > }
> > > return P*D_SIZE;
> > > }
> >
> > Still doesn't work because pages can be mapped at funny offsets.
>
> Wait, what?! Is there hardware that has unaligned TLB page-sizes?

No, you can force a 2MB page to be mapped at an address which isn't
2MB aligned.

> Can you start a 64K page at an 8k offset? I don't think I've ever seen
> that. Still even with that, how would the above go wrong there? It would
> find the compound page covering @addr, PageHuge() (and possibly some
> addition arch specific condition) returns true and we get the compound
> size to find the hardware page size used.

On any architecture I can think of, that 2MB page will be mapped with 4kB
TLB entries.

> > What we really want is for a weak definition of
> >
> > unsigned long tlb_size(struct mm_struct *mm, unsigned long addr)
> > {
> > if (p*d_leaf(p*d))
> > return p*d_size(p*d);
> > }
> >
> > then ARM can look at its special bit in the page table to determine
> > whether this is a singleton or part of a brace of pages.
>
> That's basically what we provide. but really the only thing that's
> missing from this generic page walker is the ability to detect if a
> !PageHuge compound page is actually still a hardware page.
>
> > > Now, when you add !PMD THP sizes (presumably for architectures that have
> > > 'funny' sizes, otherwise what's the point), then you get to add '||
> >
> > This is the problem with all the huge page support in Linux today.
> > It's written by people who work for hardware companies who think only
> > about exploiting the hardware features they sell. You all ignore the
> > very real software overhedas of trying to manage millions of pages.
> > I see a 6% reduction in kernel overhead when running kernbench using
> > THPs that may go as large as 256kB. On x86. Intel x86, at that.
>
> That's a really nice improvement. However then this code doesn't care
> about it. Please make it possible to distinguish between THP on hardware
> pages vs software pages.

That can and should be done just by looking at the page table entries.
There's no need to convert it into a struct page. The CPU obviously
decides what TLB entry size to use based solely on the page tables,
so we can too.