Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

From: Dan Williams
Date: Sun Apr 16 2017 - 11:53:56 EST


On Sat, Apr 15, 2017 at 8:01 PM, Benjamin Herrenschmidt
<benh@xxxxxxxxxxxxxxxxxxx> wrote:
> On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote:
>> I'm wondering, since this is limited to support behind a single
>> switch, if you could have a software-iommu hanging off that switch
>> device object that knows how to catch and translate the non-zero
>> offset bus address case. We have something like this with VMD driver,
>> and I toyed with a soft pci bridge when trying to support AHCI+NVME
>> bar remapping. When the dma api looks up the iommu for its device it
>> hits this soft-iommu and that driver checks if the page is host memory
>> or device memory to do the dma translation. You wouldn't need a bit in
>> struct page, just a lookup to the hosting struct dev_pagemap in the
>> is_zone_device_page() case and that can point you to p2p details.
>
> I was thinking about a hook in the arch DMA ops but that kind of
> wrapper might work instead indeed. However I'm not sure what's the best
> way to "instantiate" it.
>
> The main issue is that the DMA ops are a function of the initiator,
> not the target (since the target is supposed to be memory) so things
> are a bit awkward.
>
> One (user ?) would have to know that a given device "intends" to DMA
> directly to another device.
>
> This is awkward because in the ideal scenario, this isn't something the
> device knows. For example, one could want to have an existing NIC DMA
> directly to/from NVME pages or GPU pages.
>
> The NIC itself doesn't know the characteristic of these pages, but
> *something* needs to insert itself in the DMA ops of that bridge to
> make it possible.
>
> That's why I wonder if it's the struct page of the target that should
> be "marked" in such a way that the arch dma'ops can immediately catch
> that they belong to a device and might require "wrapped" operations.
>
> Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
> flag ?)

Yes, is_zone_device_page(). However I think we're getting to the point
with pmem, hmm, cdm, and now p2p where ZONE_DEVICE is losing specific
meaning and we need to have explicit type checks like is_hmm_page()
is_p2p_page() that internally check is_zone_device_page() plus some
other specific type.

> That would allow us to keep a fast path for normal memory targets, but
> also have some kind of way to handle the special cases of such peer 2
> peer (or also handle other type of peer to peer that don't necessarily
> involve PCI address wrangling but could require additional iommu bits).
>
> Just thinking out loud ... I don't have a firm idea or a design. But
> peer to peer is definitely a problem we need to tackle generically, the
> demand for it keeps coming up.

ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
context about the physical address in question. I'm thinking you can
hang bus address translation data off of that structure. This seems
vaguely similar to what HMM is doing.