Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

From: Jason Gunthorpe
Date: Wed Mar 06 2024 - 10:05:48 EST


On Wed, Mar 06, 2024 at 03:33:21PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 05, 2024 at 08:51:56AM -0700, Keith Busch wrote:
> > On Tue, Mar 05, 2024 at 01:18:47PM +0200, Leon Romanovsky wrote:
> > > @@ -236,7 +236,9 @@ struct nvme_iod {
> > > unsigned int dma_len; /* length of single DMA segment mapping */
> > > dma_addr_t first_dma;
> > > dma_addr_t meta_dma;
> > > - struct sg_table sgt;
> > > + struct dma_iova_attrs iova;
> > > + dma_addr_t dma_link_address[128];
> > > + u16 nr_dma_link_address;
> > > union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
> > > };
> >
> > That's quite a lot of space to add to the iod. We preallocate one for
> > every request, and there could be millions of them.
>
> Yes. And this whole proposal also seems clearly confused (not just
> because of the gazillion reposts) but because it mixes up the case
> where we can coalesce CPU regions into a single dma_addr_t range
> (iommu and maybe in the future swiotlb) and one where we need a

I had the broad expectation that the DMA API user would already be
providing a place to store the dma_addr_t as it has to feed that into
the HW. That memory should simply last up until we do dma unmap and
the cases that need dma_addr_t during unmap can go get it from there.

If that is how things are organized, is there another reason to lean
further into single-range case optimization?

We can't do much on the map side as single range doesn't imply
contiguous range, P2P and alignment create discontinuities in the
dma_addr_t that still have to be delt with.

Jason