Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

From: Claudio Imbrenda
Date: Mon Jul 03 2023 - 07:01:44 EST


On Fri, 30 Jun 2023 12:22:43 -0700 (PDT)
Hugh Dickins <hughd@xxxxxxxxxx> wrote:

[...]

> That's something I would have expected to be handled already via
> mmu_notifiers, rather than buried inside the page table freeing.
>
> If s390 is the only architecture to go that way, and could instead do
> it via mmu_notifiers, then I think that will be more easily supported
> in the long term.

I am very well aware of that, and in fact I've been working on
exactly that for some time already. But it's a very complex minefield
and therefore I'm proceeding *very* carefully. It will take quite some
time before anything comes out.

>
> But I'm writing from a position of very great ignorance: advising
> KVM on s390 is many dimensions away from what I'm capable of.

fair enough, but in this case it doesn't mean that you are not right :)

>
> >
> > the point here is: we need that only for page_table_free_rcu(); all
> > other users of page_table_free() cannot act on guest page tables
>
> I might be wrong, but I think that most users of page_table_free()
> are merely freeing a page table which had to be allocated up front,
> but was then found unnecessary (maybe a racing task already inserted
> one): page tables which were never exposed to actual use.

that was my impression as well

> > (because we don't allow THP for KVM guests). and that is why
> > page_table_free() does not do gmap_unlink() currently.
>
> But THP collapse does (or did before this series) use it to free a
> page table which had been exposed to use. The fact that s390 does

that is also my understanding

> not allow THP for KVM guests makes page_table_free(), and this new
> pte_free_defer(), safe for that; but it feels dangerously coincidental.

not really; my guess is that we _intentionally_ did not do anything
there, because we knew we did not need it, knowing well that
we would need it once we would want to support THP for guests.
so not a coincidence, but a conscious decision based, I guess, on
touching as little code as needed.

>
> It's easy to imagine a future change being made, which would stumble
> over this issue. I have imagined that pte_free_defer() will be useful
> in future, in the freeing of empty page tables: but s390 may pose a
> problem there - though perhaps no more of a problem than additionally
> needing to pass a virtual address down the stack.

yeah it can always be fixed later if we need to

>
> >
> > >
> > > >
> > > > or will it be used instead of page_table_free?
> > >
> > > Not always; but yes, this case of removing a page table used
> > > page_table_free() before; but now, with the lighter locking, needs
> > > to keep the page table valid until the RCU grace period expires.
> >
> > so if I understand correctly your code will, sometimes, under some
> > circumstances, replace what page_table_free() does, but it will never
> > replace page_table_free_rcu()?
> >
> > because in that case there would be no issues
>
> Yes, thanks for confirming: we have no issue here at present, but may
> do if use of pte_free_defer() is extended to other contexts in future.
>
> Would it be appropriate to add a WARN_ON_ONCE around that
> > > > > + if (mm_alloc_pgste(mm)) {
> in pte_free_defer()?

that's actually not a bad idea. should never happen, but... that's the
whole point of a WARN_ON after all

>
> I ask that somewhat rhetorically: that block disappears in the later
> version I was working on last night (and will return to shortly), in
> which pte_free_defer() just sets a bit and calls page_table_free().
>
> But I'd like to understand the possibilities better: does mm_alloc_pgste()
> correspond 1:1 to KVM guest on s390, or does it cover several different
> possibilities of which KVM guest is one, or am I just confused to be
> thinking there's any relationship?

this is... historically complicated (because of course it is)

in theory any process can allocate PGSTEs by having the right bits in
the ELF header (that's how QEMU does it currently). And QEMU will have
PGSTEs allocated even if it does not actually start any guests.

Then we have the vm.allocate_pgste sysctl knob; once enabled, it will
cause all processes to have PGSTEs allocated. This is how we handled
PGSTEs before we switched to ELF header bits.

So in summary: in __practice__ yes, only QEMU will have PGSTEs. But in
theory anything is possible and allowed.

>
> Thanks,
> Hugh
>
> >
> > >
> > > >
> > > > this is actually quite important for KVM on s390
> > >
> > > None of us are wanting to break KVM on s390: your guidance appreciated!
> > >
> > > Thanks,
> > > Hugh