Re: [PATCH] riscv: pgtable: Enhance set_pte to prevent OoO risk

From: Guo Ren
Date: Mon Dec 11 2023 - 06:36:28 EST


On Mon, Dec 11, 2023 at 5:04 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
>
> On Mon, Dec 11, 2023 at 9:41 AM Guo Ren <guoren@xxxxxxxxxx> wrote:
> >
> > On Mon, Dec 11, 2023 at 1:52 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> > >
> > > Hi Guo,
> > >
> > > On Fri, Dec 8, 2023 at 4:10 PM <guoren@xxxxxxxxxx> wrote:
> > > >
> > > > From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> > > >
> > > > When changing from an invalid pte to a valid one for a kernel page,
> > > > there is no need for tlb_flush. It's okay for the TSO memory model, but
> > > > there is an OoO risk for the Weak one. eg:
> > > >
> > > > sd t0, (a0) // a0 = pte address, pteval is changed from invalid to valid
> > > > ...
> > > > ld t1, (a1) // a1 = va of above pte
> > > >
> > > > If the ld instruction is executed speculatively before the sd
> > > > instruction. Then it would bring an invalid entry into the TLB, and when
> > > > the ld instruction retired, a spurious page fault occurred. Because the
> > > > vmemmap has been ignored by vmalloc_fault, the spurious page fault would
> > > > cause kernel panic.
> > > >
> > > > This patch was inspired by the commit: 7f0b1bf04511 ("arm64: Fix barriers
> > > > used for page table modifications"). For RISC-V, there is no requirement
> > > > in the spec to guarantee all tlb entries are valid and no requirement to
> > > > PTW filter out invalid entries. Of course, micro-arch could give a more
> > > > robust design, but here, use a software fence to guarantee.
> > > >
> > > > Signed-off-by: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> > > > Signed-off-by: Guo Ren <guoren@xxxxxxxxxx>
> > > > ---
> > > > arch/riscv/include/asm/pgtable.h | 7 +++++++
> > > > 1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > > > index 294044429e8e..2fae5a5438e0 100644
> > > > --- a/arch/riscv/include/asm/pgtable.h
> > > > +++ b/arch/riscv/include/asm/pgtable.h
> > > > @@ -511,6 +511,13 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
> > > > static inline void set_pte(pte_t *ptep, pte_t pteval)
> > > > {
> > > > *ptep = pteval;
> > > > +
> > > > + /*
> > > > + * Only if the new pte is present and kernel, otherwise TLB
> > > > + * maintenance or update_mmu_cache() have the necessary barriers.
> > > > + */
> > > > + if (pte_val(pteval) & (_PAGE_PRESENT | _PAGE_GLOBAL))
> > > > + RISCV_FENCE(rw,rw);
> > >
> > > Only a sfence.vma can guarantee that the PTW actually sees a new
> > > mapping, a fence is not enough. That being said, new kernel mappings
> > > (vmalloc ones) are correctly handled in the kernel by using
> > > flush_cache_vmap(). Did you observe something that this patch fixes?
> > Thx for the reply!
> >
> > The sfence.vma is too expensive, so the situation is tricky. See the
> > arm64 commit: 7f0b1bf04511 ("arm64: Fix barriers used for page table
> > modifications"), which is similar. That is, linux assumes invalid pte
> > won't get into TLB. Think about memory hotplug:
> >
> > mm/sparse.c: sparse_add_section() {
> > ...
> > memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
> > if (IS_ERR(memmap))
> > return PTR_ERR(memmap);
> >
> > /*
> > * Poison uninitialized struct pages in order to catch invalid flags
> > * combinations.
> > */
> > page_init_poison(memmap, sizeof(struct page) * nr_pages);
> > ...
> > }
> > The section_activate would use set_pte to setup vmemmap, and
> > page_init_poison would access these pages' struct.
>
> So I think the generic code must be fixed by adding a
> flush_cache_vmap() in vmemmap_populate_range() or similar: several
> architectures implement flush_cache_vmap() because they need to do
> "something" after a new mapping is established, so vmemmap should not
> be any different.
Perhaps generic code assumes TLB won't contain invalid entries. When
invalid -> valid, Linux won't do any tlb_flush, ref:

* Use set_p*_safe(), and elide TLB flushing, when confident that *no*
* TLB flush will be required as a result of the "set". For example, use
* in scenarios where it is known ahead of time that the routine is
* setting non-present entries, or re-setting an existing entry to the
* same value. Otherwise, use the typical "set" helpers and flush the
* TLB.

>
> >
> > That means:
> > sd t0, (a0) // a0 = struct page's pte address, pteval is changed from
> > invalid to valid
> > ...
> > lw/sw t1, (a1) // a1 = va of struct page
> >
> > If the lw/sw instruction is executed speculatively before the set_pte,
> > we need a fence to prevent this.
>
> Yes I agree, but to me we need the fence property of sfence.vma to
> make sure the PTW sees the new pte, unless I'm mistaken and something
> in the privileged specification states that a fence is enough?
All PTW are triggered by IFU & load/store. For the "set" scenarios, we
just need to prevent the access va before the set_pte. So:
- Don't worry about IFU, which fetches the code sequentially.
- Use a fence prevent load/store before set_pte.

Sfence.vma is used for invalidate TLB, not for invalid -> valid.

>
> >
> > >
> > > Thanks,
> > >
> > > Alex
> > >
> > > > }
> > > >
> > > > void flush_icache_pte(pte_t pte);
> > > > --
> > > > 2.40.1
> > > >
> >
> >
> >
> > --
> > Best Regards
> > Guo Ren



--
Best Regards
Guo Ren