Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans_unstable() to retry

From: Yang Shi
Date: Tue Jun 06 2023 - 15:13:45 EST


On Mon, Jun 5, 2023 at 12:20 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote:
> > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > >
> > > For most of the page walk paths, logically it'll always be good to have the
> > > pmd retries if hit pmd_trans_unstable() race. We can treat it as none
> > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> > > even treating that as a none pmd. If to fix it anyway, a retry will be the
> > > most accurate.
> > >
> > > I've went over all the pmd_trans_unstable() special cases and this patch
> > > should cover all the rest places where we should retry properly with
> > > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can
> > > easily achieve that.
> > >
> > > These are the call sites that I think should be fixed with it:
> > >
> > > *** fs/proc/task_mmu.c:
> > > smaps_pte_range[634] if (pmd_trans_unstable(pmd))
> > > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd))
> > > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp))
> > > gather_pte_stats[1891] if (pmd_trans_unstable(pmd))
> > > *** mm/memcontrol.c:
> > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> > > *** mm/memory-failure.c:
> > > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp))
> > > *** mm/mempolicy.c:
> > > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd))
> > > *** mm/madvise.c:
> > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> > > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd))
> > >
> > > IIUC most of them may or may not be a big issue even without a retry,
> > > either because they're already not strict (smaps, pte_stats, MADV_COLD,
> > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> > > cold worst case), but some of them could have functional error without the
> > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> > > the unstable pmd range.. so IIUC the pagemap result can be wrong).
> > >
> > > While these call sites all look fine, and don't need any change:
> > >
> > > *** include/linux/pgtable.h:
> > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> > > *** mm/gup.c:
> > > follow_pmd_mask[695] if (pmd_trans_unstable(pmd))
> > > *** mm/mapping_dirty_helpers.c:
> > > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval))
> > > *** mm/memory.c:
> > > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > > *** mm/migrate_device.c:
> > > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp)))
> > > *** mm/mincore.c:
> > > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) {
> > >
> > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > > ---
> > > fs/proc/task_mmu.c | 17 +++++++++++++----
> > > mm/madvise.c | 8 ++++++--
> > > mm/memcontrol.c | 8 ++++++--
> > > mm/memory-failure.c | 4 +++-
> > > mm/mempolicy.c | 4 +++-
> > > 5 files changed, 31 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 6259dd432eeb..823eaba5c6bf 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > goto out;
> > > }
> > >
> > > - if (pmd_trans_unstable(pmd))
> > > + if (pmd_trans_unstable(pmd)) {
> > > + walk->action = ACTION_AGAIN;
> > > goto out;
> > > + }
> > > +
> > > /*
> > > * The mmap_lock held all the way back in m_start() is what
> > > * keeps khugepaged out of here and from collapsing things
> > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> > > return 0;
> > > }
> > >
> > > - if (pmd_trans_unstable(pmd))
> > > + if (pmd_trans_unstable(pmd)) {
> > > + walk->action = ACTION_AGAIN;
> > > return 0;
> > > + }
> > >
> > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > for (; addr != end; pte++, addr += PAGE_SIZE) {
> > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > > return err;
> > > }
> > >
> > > - if (pmd_trans_unstable(pmdp))
> > > + if (pmd_trans_unstable(pmdp)) {
> > > + walk->action = ACTION_AGAIN;
> > > return 0;
> >
> > Had a quick look at the pagemap code, I agree with your analysis,
> > "returning 0" may mess up pagemap, retry should be fine. But I'm
> > wondering whether we should just fill in empty entries. Anyway I don't
> > have a strong opinion on this, just a little bit concerned by
> > potential indefinite retry.
>
> Yes, none pte is still an option. But if we're going to fix this anyway,
> it seems better to fix it with the accurate new thing that poped up, and
> it's even less change (just apply walk->action rather than doing random
> stuff in different call sites).
>
> I see that you have worry on deadloop over this, so I hope to discuss
> altogether here.
>
> Unlike normal checks, pmd_trans_unstable() check means something must have
> changed in the past very short period or it should just never if nothing
> changed concurrently from under us, so it's not a "if (flag==true)" check
> which is even more likely to loop.
>
> If we see the places that I didn't touch, most of them suggested a retry in
> one form or another. So if there's a worry this will also not the first
> time to do a retry (and for such a "unstable" API, that's really the most
> natural thing to do which is to retry until it's stable).

IIUC other than do_anonymous_page() suggests retry (retry page fault),
others may not, for example:
- follow_pmd_mask: return -EBUSY
- wp_clean_pmd_entry: actually just retry for pmd_none case, but the
pagewalk code does handle pmd_none by skipping it, so it basically
just retry once
- min_core_pte_range: treated as unmapped range by calling
__mincore_unmapped_range

Anyway I really don't have a strong opinion on this. I may be just
over-concerned. I just thought if nobody cares whether the result is
accurate or not, why do we bother fixing those cases?

>
> So in general, it seems to me if we deadloop over pmd_trans_unstable() for
> whatever reason then something more wrong could have happened..
>
> Thanks,
>
> --
> Peter Xu
>