RE: mm/DAMON: Profiling enhancements for DAMON

From: Prasad, Aravinda
Date: Mon Dec 18 2023 - 06:32:54 EST




> -----Original Message-----
> From: Yu Zhao <yuzhao@xxxxxxxxxx>
> Sent: Saturday, December 16, 2023 11:12 AM
> To: Prasad, Aravinda <aravinda.prasad@xxxxxxxxx>
> Cc: damon@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; sj@xxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; s2322819@xxxxxxxx; Kumar, Sandeep4
> <sandeep4.kumar@xxxxxxxxx>; Huang, Ying <ying.huang@xxxxxxxxx>; Hansen,
> Dave <dave.hansen@xxxxxxxxx>; Williams, Dan J <dan.j.williams@xxxxxxxxx>;
> Subramoney, Sreenivas <sreenivas.subramoney@xxxxxxxxx>; Kervinen, Antti
> <antti.kervinen@xxxxxxxxx>; Kanevskiy, Alexander
> <alexander.kanevskiy@xxxxxxxxx>; Alan Nair <alan.nair@xxxxxxxxx>; Juergen
> Gross <jgross@xxxxxxxx>; Ryan Roberts <ryan.roberts@xxxxxxx>
> Subject: Re: mm/DAMON: Profiling enhancements for DAMON
>
> On Fri, Dec 15, 2023 at 3:08 AM Prasad, Aravinda <aravinda.prasad@xxxxxxxxx>
> wrote:
> >
> > > On Fri, Dec 15, 2023 at 12:42 AM Aravinda Prasad
> > > <aravinda.prasad@xxxxxxxxx> wrote:
> > > ...
> > >
> > > > This patch proposes profiling different levels of the
> > > > application’s page table tree to detect whether a region is
> > > > accessed or not. This patch is based on the observation that, when
> > > > the accessed bit for a page is set, the accessed bits at the
> > > > higher levels of the page table tree (PMD/PUD/PGD) corresponding
> > > > to the path of the page table walk are also set. Hence, it is
> > > > efficient to check the accessed bits at the higher levels of the
> > > > page table tree to detect whether a region is accessed or not.
> > >
> > > This patch can crash on Xen. See commit 4aaf269c768d("mm: introduce
> > > arch_has_hw_nonleaf_pmd_young()")
> >
> > Will fix as suggested in the commit.
> >
> > >
> > > MGLRU already does this in the correct way. See mm/vmscan.c.

noted

> >
> > I don't see access bits at PUD or PGD checked for 4K page size. Can
> > you point me to the code where access bits are checked at PUD and PGD level?
>
> There isn't any, because *the system* bottlenecks at the PTE level and at moving
> memory between tiers. Optimizing at the PUD/PGD levels has insignificant ROI
> for the system.

Optimization at PUD/PGD can be used for large footprint applications, especially
for damon, to find if any pages in a region are accessed or not.

>
> And food for thought:
> 1. Can a PUD/PGD cover memory from different tiers?

Yes, it can.

> 2. Can the A-bit in non-leaf entries work for EPT?

Need to check.

>
> > > This patch also can cause USER DATA CORRUPTION. See commit
> > > c11d34fa139e ("mm/damon/ops-common: atomically test and clear young
> > > on ptes and pmds").
> >
> > Ok. Will atomically test and set the access bits.
> >
> > >
> > > The quality of your patch makes me very much doubt the quality of
> > > your paper, especially your results on Google's kstaled and MGLRU in table
> 6.2.
> >
> > The results are very much reproducible. We have not used kstaled/MGLRU
> > for the data in Figure 3, but we linearly scan pages similar to
> > kstaled by implementing a kernel thread for scanning.
>
> You have not used MGLRU, and yet your results are very much reproducible.

As we have mentioned in the paper, the results are for checking/scanning
accessed bits for pages at leaf level (PTE for 4K and PMD for 2M). In general
this is applicable to any technique using leaf level scanning where for large
footprint applications, the scanning time drastically increases.

MGLRU also scans leaf level accessed bits and hence falls into this category

Similar observations on scanning were also made by HeMem [2] in Figure 3.

[2] HeMem: Scalable Tiered Memory Management for Big Data Applications
and Real NVM", https://dl.acm.org/doi/pdf/10.1145/3477132.3483550

>
> > Our argument for kstaled/MGLRU is that, scanning individual pages at
> > 4K granularity may not be efficient for large footprint applications.
>
> Your argument for MGLRU is based on a wrong assumption, as I have already
> pointed out.

Our argument in the paper is for any technique that is scanning leaf level
accessed bits, be it kstaled or MGLRU.

>
> > Instead,
> > access bits at the higher level of the page table tree can be used. In
> > the paper we have demonstrated this with DAMON but the concept can be
> > applied to kstaled/MGLRU as well.
>
> You got it backward: MGLRU introduced the concept; you fabricated a comparison
> table.

Not convinced. I see from documentation mentioning that "clearing the accessed
bit in non-leaf page table entries" with 0x0004 in /sys/kernel/mm/lru_gen/enabled

But the code is restricted to PMD only.

static bool should_clear_pmd_young(void)
{
return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
}

Regards,
Aravinda