Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

From: Christian König
Date: Mon Oct 04 2021 - 09:24:22 EST


Am 04.10.21 um 15:11 schrieb Jason Gunthorpe:
On Mon, Oct 04, 2021 at 08:58:35AM +0200, Christian König wrote:
I'm not following this discussion to closely, but try to look into it from
time to time.

Am 01.10.21 um 19:45 schrieb Jason Gunthorpe:
On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:

In device-dax, the refcount is only used to prevent the device, and
therefore the pages, from going away on device unbind. Pages cannot be
recycled, as you say, as they are mapped linearly within the device. The
address space invalidation is done only when the device is unbound.
By address space invalidation I mean invalidation of the VMA that is
pointing to those pages.

device-dax may not have a issue with use-after-VMA-invalidation by
it's very nature since every PFN always points to the same
thing. fsdax and this p2p stuff are different though.

Before the invalidation, an active flag is cleared to ensure no new
mappings can be created while the unmap is proceeding.
unmap_mapping_range() should sequence itself with the TLB flush and
AFIAK unmap_mapping_range() kicks off the TLB flush and then
returns. It doesn't always wait for the flush to fully finish. Ie some
cases use RCU to lock the page table against GUP fast and so the
put_page() doesn't happen until the call_rcu completes - after a grace
period. The unmap_mapping_range() does not wait for grace periods.
Wow, wait a second. That is quite a boomer. At least in all GEM/TTM based
graphics drivers that could potentially cause a lot of trouble.

I've just double checked and we certainly have the assumption that when
unmap_mapping_range() returns the pte is gone and the TLB flush completed in
quite a number of places.

Do you have more information when and why that can happen?
There are two things to keep in mind, flushing the PTEs from the HW
and serializing against gup_fast.

If you start at unmap_mapping_range() the page is eventually
discovered in zap_pte_range() and the PTE cleared. It is then passed
into __tlb_remove_page() which puts it on the batch->pages list

The page free happens in tlb_batch_pages_flush() via
free_pages_and_swap_cache()

The tlb_batch_pages_flush() happens via zap_page_range() ->
tlb_finish_mmu(), presumably after the HW has wiped the TLB's on all
CPUs. On x86 this is done with an IPI and also serializes gup fast, so
OK

The interesting case is CONFIG_MMU_GATHER_RCU_TABLE_FREE which doesn't
rely on IPIs anymore to synchronize with gup-fast.

In this configuration it means when unmap_mapping_range() returns the
TLB will have been flushed, but no serialization with GUP fast was
done.

This is OK if the GUP fast cannot return the page at all. I assume
this generally describes the DRM caes?

Yes, exactly that. GUP is completely forbidden for such mappings.

But what about accesses by other CPUs? In other words our use case is like the following:

1. We found that we need exclusive access to the higher level object a page belongs to.

2. The lock of the higher level object is taken. The lock is also taken in the fault handler for the VMA which inserts the PTE in the first place.

3. unmap_mapping_range() for the range of the object is called, the expectation is that when that function returns only the kernel can have a mapping of the pages backing the object.

4. The kernel has exclusive access to the pages and we know that userspace can't mess with them any more.

That use case is completely unrelated to GUP and when this doesn't work we have quite a problem.

I should probably note that we recently switched from VM_MIXEDMAP to using VM_PFNMAP because the former didn't prevented GUP on all architectures.

Christian.

However, if the GUP fast can return the page then something,
somewhere, needs to serialize the page free with the RCU as the GUP
fast can be observing the old PTE before it was zap'd until the RCU
grace expires.

Relying on the page ref being !0 to protect GUP fast is not safe
because the page ref can be incr'd immediately upon page re-use.

Interestingly I looked around for this on PPC and I only found RCU
delayed freeing of the page table level, not RCU delayed freeing of
pages themselves.. I wonder if it was missed?

There is a path on PPC (tlb_remove_table_sync_one) that triggers an
IPI but it looks like an exception, and we wouldn't need the RCU at
all if we used IPI to serialize GUP fast...

It makes logical sense if the RCU also frees the pages on
CONFIG_MMU_GATHER_RCU_TABLE_FREE so anything returnable by GUP fast
must be refcounted and freed by tlb_batch_pages_flush(), not by the
caller of unmap_mapping_range().

If we expect to allow the caller of unmap_mapping_range() to free then
CONFIG_MMU_GATHER_RCU_TABLE_FREE can't really exist, we always need to
trigger a serializing IPI during tlb_batch_pages_flush()

AFAICT, at least

Jason