Re: [RFC PATCH v2 3/3] mm: mlock: update mlock_pte_range to handle large folio

From: Yosry Ahmed
Date: Wed Jul 19 2023 - 11:44:59 EST


On Wed, Jul 19, 2023 at 7:26 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
>
> On Wed, 19 Jul 2023, Yin Fengwei wrote:
> > >>>>>>>>> Could this also happen against normal 4K page? I mean when user try to munlock
> > >>>>>>>>> a normal 4K page and this 4K page is isolated. So it become unevictable page?
> > >>>>>>>> Looks like it can be possible. If cpu 1 is in __munlock_folio() and
> > >>>>>>>> cpu 2 is isolating the folio for any purpose:
> > >>>>>>>>
> > >>>>>>>> cpu1 cpu2
> > >>>>>>>> isolate folio
> > >>>>>>>> folio_test_clear_lru() // 0
> > >>>>>>>> putback folio // add to unevictable list
> > >>>>>>>> folio_test_clear_mlocked()
> > >>>>> folio_set_lru()
> > Let's wait the response from Huge and Yu. :).
>
> I haven't been able to give it enough thought, but I suspect you are right:
> that the current __munlock_folio() is deficient when folio_test_clear_lru()
> fails.
>
> (Though it has not been reported as a problem in practice: perhaps because
> so few places try to isolate from the unevictable "list".)
>
> I forget what my order of development was, but it's likely that I first
> wrote the version for our own internal kernel - which used our original
> lruvec locking, which did not depend on getting PG_lru first (having got
> lru_lock, it checked memcg, then tried again if that had changed).

Right. Just holding the lruvec lock without clearing PG_lru would not
protect against memcg movement in this case.

>
> I was uneasy with the PG_lru aspect of upstream lru_lock implementation,
> but it turned out to work okay - elsewhere; but it looks as if I missed
> its implication when adapting __munlock_page() for upstream.
>
> If I were trying to fix this __munlock_folio() race myself (sorry, I'm
> not), I would first look at that aspect: instead of folio_test_clear_lru()
> behaving always like a trylock, could "folio_wait_clear_lru()" or whatever
> spin waiting for PG_lru here?

+Matthew Wilcox

It seems to me that before 70dea5346ea3 ("mm/swap: convert lru_add to
a folio_batch"), __pagevec_lru_add_fn() (aka lru_add_fn()) used to do
folio_set_lru() before checking folio_evictable(). While this is
probably extraneous since folio_batch_move_lru() will set it again
afterwards, it's probably harmless given that the lruvec lock is held
throughout (so no one can complete the folio isolation anyway), and
given that there were no problems introduced by this extra
folio_set_lru() as far as I can tell.

If we restore folio_set_lru() to lru_add_fn(), and revert 2262ace60713
("mm/munlock:
delete smp_mb() from __pagevec_lru_add_fn()") to restore the strict
ordering between manipulating PG_lru and PG_mlocked, I suppose we can
get away without having to spin. Again, that would only be possible if
reworking mlock_count [1] is acceptable. Otherwise, we can't clear
PG_mlocked before PG_lru in __munlock_folio().

I am not saying this is necessarily better than spinning, just a note
(and perhaps selfishly making [1] more appealing ;)).

[1]https://lore.kernel.org/lkml/20230618065719.1363271-1-yosryahmed@xxxxxxxxxx/

>
> Hugh