Re: mm: delay rmap removal until after TLB flush

From: Johannes Weiner
Date: Mon Nov 07 2022 - 15:07:22 EST

Next message: Vladimir Oltean: "Re: [PATCH net-next v8 3/5] net: dsa: add out-of-band tagging protocol"
Previous message: Daniel Vetter: "Re: [PATCH] docs/sphinx: More depth in the rtd sidebar toc"
In reply to: Peter Zijlstra: "Re: mm: delay rmap removal until after TLB flush"
Next in thread: Linus Torvalds: "Re: mm: delay rmap removal until after TLB flush"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote:
> lock_page_memcg (uncertain)
> ---------------------------
> Johannes, please help! Linus has got quite upset by lock_page_memcg(),
> its calls from mm/rmap.c anyway, and most particularly by the way
> in which it is called at the start of page_remove_rmap(), before
> anyone's critical atomic_add_negative(), yet its use is to guarantee
> the stability of page memcg while doing the stats updates, done only
> when atomic_add_negative() says so.

As you mentioned, the pte lock historically wasn't always taken on the
move side. And so the reason lock_page_memcg() covers the mapcount
update is that move_account() needs to atomically see either a) the
page is mapped and counted, or b) unmapped and uncounted. If we lock
after mapcountdec, move_account() could miss pending updates that need
transferred, and it would break the scheme breaks thusly:

memcg1->nr_mapped = 1
memcg2->nr_mapped = 0

page_remove_rmap: mem_cgroup_move_account():
if atomic_add_negative(page->mapcount):
lock_page_memcg()
if page->mapcount: // NOT TAKEN
memcg1->nr_mapped--
memcg2->nr_mapped++
page->memcg = memcg2
unlock_page_memcg()
lock_page_memcg()
page->memcg->nr_mapped-- // UNDERFLOW memcg2->nr_mapped
unlock_page_memcg()

> I do have one relevant insight on this. It (or its antecedents under
> other names) date from the days when we did "reparenting" of memcg
> charges from an LRU: and in those days the lock_page_memcg() before
> mapcount adjustment was vital, to pair with the uses of folio_mapped()
> or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
> are precisely around the stats which the rmap functions affect.
>
> But nowadays mem_cgroup_move_account() is only called, with page table
> lock held, on matching pages found in a task's page table: so its
> "mapped" checks are redundant - I've sometimes thought in the past of
> removing them, but held back, because I always have the notion (not
> hope!) that "reparenting" may one day be brought back from the grave.
> I'm too out of touch with memcg to know where that notion stands today.
>
> I've gone through a multiverse of opinions on those lock_page_memcg()s
> in the last day: I currently believe that Linus is right, that the
> lock_page_memcg()s could and should be moved just before the stats
> updates. But I am not 100% certain of that - is there still some
> reason why it's important that the page memcg at the instant of the
> critical mapcount transition be kept unchanged until the stats are
> updated? I've tried running scenarios through my mind but given up.

Okay, I think there are two options.

- If we don't want to codify the pte lock requirement on the move
side, then moving the lock_page_memcg() like that would break the
locking scheme, as per above.

- If we DO want to codify the pte lock requirement, we should just
remove the lock_page_memcg() altogether, as it's fully redundant.

I'm leaning toward the second option. If somebody brings back
reparenting they can bring the lock back along with it.

[ If it's even still necessary by then.

It's conceivable reparenting is brought back only for cgroup2, where
the race wouldn't matter because of the hierarchical stats. The
reparenting side wouldn't have to move page state to the parent -
it's already there. And whether rmap would see the dying child or
the parent doesn't matter much either: the parent will see the
update anyway, directly or recursively, and we likely don't care to
balance the books on a dying cgroup.

It's then just a matter of lifetime - which should be guaranteed
also, as long as the pte lock prevents an RCU quiescent state. ]

So how about something like below?

UNTESTED, just for illustration. This is cgroup1 code, which I haven't
looked at too closely in a while. If you can't spot an immediate hole
in it, I'd go ahead and test it and send a proper patch.

---

Next message: Vladimir Oltean: "Re: [PATCH net-next v8 3/5] net: dsa: add out-of-band tagging protocol"
Previous message: Daniel Vetter: "Re: [PATCH] docs/sphinx: More depth in the rtd sidebar toc"
In reply to: Peter Zijlstra: "Re: mm: delay rmap removal until after TLB flush"
Next in thread: Linus Torvalds: "Re: mm: delay rmap removal until after TLB flush"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]