Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

From: Gerald Schaefer
Date: Wed Jul 05 2023 - 08:56:53 EST


On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
Hugh Dickins <hughd@xxxxxxxxxx> wrote:

> On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > >
> > > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > > extent that I've not even tried to verify it; but I think I do get the
> > > > point now, that we need further info than just PPHHAA to know whether
> > > > the page is on the list or not. But I think that if we move where the
> > > > call_rcu() is done, then the page can stay on or off the list by same
> > > > rules as before (but need to check HH bits along with PP when deciding
> > > > whether to allocate, and whether to list_add_tail() when freeing).
> > >
> > > No, not quite the same rules as before: I came to realize that using
> > > list_add_tail() for the HH pages would be liable to put a page on the
> > > list which forever blocked reuse of PP list_add_tail() pages after it
> > > (could be solved by a list_move() somewhere, but we have agreed to
> > > prefer simplicity).
> > >
> > > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > > an easier way of dealing with those "is it on the list" questions.
> > > I expect that we shall be close to reaching agreement on...
> >
> > This looks really nice, almost too good and easy to be true. I did not
> > find any obvious flaw, just some comments below. It also survived LTP
> > without any visible havoc, so I guess this approach is the best so far.
>
> Phew! I'm of course glad to hear this: thanks for your efforts on it.
>
> ...
> > > --- a/arch/s390/mm/pgalloc.c
> > > +++ b/arch/s390/mm/pgalloc.c
> > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > > * while the PP bits are never used, nor such a page is added to or removed
> > > * from mm_context_t::pgtable_list.
> > > + *
> > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> >
> > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > allow reuse before grace period elapsed. And I hope that it does so, by
> > setting the PP bits, which would be noticed in page_table_alloc(), in
> > case the page would be seen there.
> >
> > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > end of the list, and so they could be seen in page_table_alloc(), but they
> > should not be reused before grace period elapsed and __tlb_remove_table()
> > cleared the PP bits, as far as I understand.
> >
> > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > guarantee"?
>
> I'll answer without locating and re-reading what Jason explained earlier,
> perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> he may have explained it better. And without working out again all the
> MMU_GATHER #defines, and which of them do and do not apply to s390 here.
>
> The detail that sticks in my mind is the fallback in tlb_remove_table()

Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
but that is rather a generic issue, and not s390-specific. I thought you
meant some s390-oddity here, of which we have a lot, unfortunately...
Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
I guess you could say that page_table_free_rcu() cannot guarantee what
tlb_remove_table() cannot guarantee.

Maybe change to "which page_table_free_rcu() does not actually guarantee,
by calling tlb_remove_table()", to make it clear that this is not a problem
of page_table_free_rcu() itself.

> in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> batch the tables for freeing by RCU, and resorts instead to an immediate
> TLB flush (I think: that again involves chasing definitions) followed by
> tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> and is commented:
> /*
> * This isn't an RCU grace period and hence the page-tables cannot be
> * assumed to be actually RCU-freed.
> *
> * It is however sufficient for software page-table walkers that rely on
> * IRQ disabling.
> */
>
> Whether that's good for your PP pages or not, I've given no thought:
> I've just taken it on trust that what s390 has working today is good.

Yes, we should be fine with that, current code can be trusted :-)

>
> If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> then I would not have written "(which page_table_free_rcu() does not
> actually guarantee)". But it cannot use call_rcu() because it does
> not have an rcu_head to work with - it's in some generic code, and
> there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
>
> And Jason would have much preferred us to address the issue from that
> angle; but not only would doing so destroy my sanity, I'd also destroy
> 20 architectures TLB-flushing, unbuilt and untested, in the attempt.

Oh yes, if your changes would have allowed to get rid of that "semi RCU"
logic, that would really be a major boost in popularity, I guess. But
it probably is as it is, because it is not so easily fixed...

>
> ...
> > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > */
> > > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > > mask >>= 24;
> > > - if (mask & 0x03U)
> > > + if ((mask & 0x03U) && !PageActive(page)) {
> > > + /*
> > > + * Other half is allocated, and neither half has had
> > > + * its free deferred: add page to head of list, to make
> > > + * this freed half available for immediate reuse.
> > > + */
> > > list_add(&page->lru, &mm->context.pgtable_list);
> > > - else
> > > - list_del(&page->lru);
> > > + } else {
> > > + /* If page is on list, now remove it. */
> > > + list_del_init(&page->lru);
> > > + }
> >
> > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > other half is still allocated, when called from pte_free_defer() on a
> > fully allocated page, which was not on the list (and with PageActive, and
> > (mask & 0x03U) true).
> > Not sure if adding an additional mask check to the else path would be
> > needed, but it seems that list_del_init() should also be able to handle
> > this.
>
> list_del_init() is very cheap in the unnecessary case: the cachelines
> required are already there. You don't want a flag to say whether to
> call it or not, it is already the efficient approach.

Yes, I also see no functional issue here. Just thought that the extra
write could be avoided, e.g. by checking for list_empty() or mask first.
But I guess that is simply the benefit of list_del_init(), that you
don't have to check, at least if it is guaranteed that rcu_head is
never in use here.

Then maybe adjust the comment, because now it makes you wonder, when
you read (and understand) the code, you see that this list_del_init()
might also be called for pages not on the list.

>
> (But you were right not to use it in your pt_frag_refcount version,
> because there we were still trying to do the call_rcu() per fragment
> rather than per page, so page->lru could have been on the RCU queue.)

That is actually the one thing I still try to figure out, by drawing
pictures, i.e. if we really really never end up here on list_del_init(),
while using rcu_head, e.g. by racing PageActive.

>
> >
> > Same thought applies to the similar logic in page_table_free_rcu()
> > below.
> >
> > > spin_unlock_bh(&mm->context.lock);
> > > mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > > mask >>= 24;
> > > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > > }
> > >
> > > page_table_release_check(page, table, half, mask);
> > > - pgtable_pte_page_dtor(page);
> > > - __free_page(page);
> > > + if (TestClearPageActive(page))
> > > + call_rcu(&page->rcu_head, pte_free_now);
> > > + else
> > > + pte_free_now(&page->rcu_head);
> >
> > This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> > worries me a bit, because it is done outside the spin_lock. It "feels" like
> > there could be some race with the PageActive checks inside the spin_lock,
> > but when drawing some pictures, I could not find any such scenario yet.
> > Also, our existing spin_lock is probably not supposed to protect against
> > PageActive changes anyway, right?
>
> Here (and similarly in __tlb_remove_table()) is where we are about to free
> the page table page: both of the fragments have already been released,
> there is nobody left who could be racing against us to set PageActive.

Yes, that is what makes this approach so nice, i.e. no more checking
for HH bits or worry about double call_rcu(), simply do the the freeing
whenever the page is ready. At least in theory, still drawing pictures :-)

But this really looks very good to me, and also works with LTP not worse
than the other approaches.