Re: [PATCH] numa: fix /proc/<pid>/numa_maps on s390

From: Gerald Schaefer
Date: Fri Jan 22 2016 - 08:37:19 EST


On Thu, 21 Jan 2016 11:19:19 -0800
Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Thu, 21 Jan 2016 15:57:32 +0100 Gerald Schaefer <gerald.schaefer@xxxxxxxxxx> wrote:
>
> > > --- a/fs/proc/task_mmu.c~numa-fix-proc-pid-numa_maps-on-s390-fix
> > > +++ a/fs/proc/task_mmu.c
> > > @@ -1523,6 +1523,7 @@ static int gather_pte_stats(pmd_t *pmd,
> > > pte_t *pte;
> > >
> > > ptl = pmd_trans_huge_lock(pmd, vma);
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > if (ptl) {
> > > pte_t huge_pte = huge_ptep_get((pte_t *)pmd);
> > > struct page *page;
> > > @@ -1534,6 +1535,7 @@ static int gather_pte_stats(pmd_t *pmd,
> > > spin_unlock(ptl);
> > > return 0;
> > > }
> > > +#endif
> > >
> > > if (pmd_trans_unstable(pmd))
> > > return 0;
> >
> > Hi Andrew,
> >
> > Unfortunately this seems to be a lot more complicated than we thought.
> > huge_ptep_get() is only defined when CONFIG_HUGETLB_PAGE=y, independent
> > from CONFIG_TRANSPARENT_HUGEPAGE. This is because asm/hugetlb.h is only
> > included from linux/hugetlb.h when CONFIG_HUGETLB_PAGE=y.
> >
> > So this fix won't fix the build error when CONFIG_HUGETLBFS=n and
> > CONFIG_TRANSPARENT_HUGEPAGE=y.
> >
> > Since the THP code did not repeat the flaws of the hugtelbfs code, i.e.
> > it is actually working on PMD entries and not PTE entries, there was
> > no need for huge_ptep_get() for THP so far.
> >
> > Now it seems that the THP code in gather_pte_stats() is an exception to
> > this, as it is not working on a PMD like the rest of the THP code, but
> > also on a fake "PTE" like the hugetlbfs code.
> >
> > I guess this needs more thinking, two options are crossing my mind:
> > - Fix the THP code in gather_pte_stats() to properly use a PMD instead of
> > PTE. This would probably require something like a "_pmd" version of
> > "can_gather_numa_stats()" and a pmd_dirty() check for the
> > gather_stats() parameter.
> > - Make huge_ptep_get() also available for CONFIG_HUGETLBFS=n, perhaps
> > by introducing something like HAVE_ARCH_HUGE_PTEP_GET and implementing
> > the default NOP version in linux/hugetlbfs.h instead of the individual
> > asm/hugetlbfs.h files for all archs.
> >
> > The first option seems more correct, but it might entail other problems.
> > The second option would also introduce new problems on s390, where the
> > implementation of huge_ptep_get() in arch/s390/mm/hugetlbpage.c is currently
> > only built with CONFIG_HUGETLBFS=y, but I guess we could handle that.
> >
> > Any thoughts / more ideas?
> >
>
> The first option does of course sound better.
>
> But you need numa_maps fixed, presumably in 4.5 and possibly backported
> into -stable? (The changelog doesn't describe the end-user-visible
> effects of the bug. Naughty changelog!)
>
> So is there some minimal thing we can do for now to get things working
> properly and fix it for-real in 4.6?
>

Right, the effects are worth a closer look. NUMA accounting is affected
for THP and hugetlbfs mappings on s390, and only the latter can be fixed
with huge_ptep_get() as in Michaels patch.

On current kernel levels, at least since we have NUMA on s390, both
accountings should actually work by chance for most pages, apart from
PROT_NONE PMDs where pte_present() will return 0. Dirty page accounting is
also broken, as pte_dirty() will always return 0 for PMDs.

Initially, I also had concerns about a potentially faulty memory access,
caused by falsely calculated pfn values and struct page addresses, but it
looks like we can only get "offsets" within one large page frame, where
all struct pages should exist.

So, the minimal thing to do for now would probably be to fix only
gather_hugetlb_stats(), i.e. take only the second hunk of Michaels patch.
I guess it would be best if we send a new patch, maybe even with a better
description, and revert what has been added so far.