Re: [PATCH rdma-next v1 1/1] RDMA/mana_ib: Fix bug in creation of dma regions

From: Konstantin Taranov
Date: Thu Feb 08 2024 - 17:04:48 EST


> From: Jason Gunthorpe <jgg@xxxxxxxx>
> On Thu, Feb 08, 2024 at 06:53:05PM +0000, Konstantin Taranov wrote:
> > > From: Long Li <longli@xxxxxxxxxxxxx>
> > > Sent: Thursday, 8 February 2024 19:43
> > > To: Konstantin Taranov <kotaranov@xxxxxxxxxxxxxxxxxxx>; Konstantin
> > > Taranov <kotaranov@xxxxxxxxxxxxx>; sharmaajay@xxxxxxxxxxxxx;
> > > jgg@xxxxxxxx; leon@xxxxxxxxxx
> > > Cc: linux-rdma@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> > > Subject: RE: [PATCH rdma-next v1 1/1] RDMA/mana_ib: Fix bug in
> > > creation of dma regions
> > >
> > > >
> > > > /* Hardware requires dma region to align to chosen page size */
> > > > - page_sz = ib_umem_find_best_pgsz(umem, PAGE_SZ_BM, 0);
> > > > + page_sz = ib_umem_find_best_pgsz(umem, PAGE_SZ_BM, virt);
> > > > if (!page_sz) {
> > > > ibdev_dbg(&dev->ib_dev, "failed to find page size.\n");
> > > > return -ENOMEM;
> > > > }
> > >
> > > How about doing:
> > > page_sz = ib_umem_find_best_pgsz(umem, PAGE_SZ_BM,
> force_zero_offset
> > > ? 0 : virt);
> > >
> > > Will this work? This can get rid of the following while loop.
> > >
> >
> > I do not think so. I mentioned once, that it was failing for me with
> > existing code with the 4K-aligned addresses and 8K pages. In this
> > case, we miscalculate the number of pages. So, we think that it is one 8K
> page, but it is in fact two.
>
> That is a confusing statement.. What is "we" here?

Sorry, I meant here the helper thinks that it is one 8K page for 8K buffer.
But the offset will not not zero when the buffer is only 4K aligned. So the actual
Number of pages is 2, but the helper thinks that it is one because of wrong virt.

>
> ib_umem_dma_offset() is not always guaranteed to be zero, with a 0 iova.
> With higher order pages the offset can be within the page, it generates
>
> offset = IOVA % pgsz
>
> There are a couple places that do want the offset to be fixed to zero and have
> the loop, at this point it would be good to consolidate them into some
> common ib_umem_find_best_pgsz_zero_offset() or something.
>
> > > > +
> > > > + if (force_zero_offset) {
> > > > + while (ib_umem_dma_offset(umem, page_sz) && page_sz >
> > > > PAGE_SIZE)
> > > > + page_sz /= 2;
> > > > + if (ib_umem_dma_offset(umem, page_sz) != 0) {
> > > > + ibdev_dbg(&dev->ib_dev, "failed to find page
> > > > + size to
> > > > force zero offset.\n");
> > > > + return -ENOMEM;
> > > > + }
> > > > + }
> > > > +
>
> Yes this doesn't look quite right..
>
> It should flow from the HW capability, the helper you call should be tightly
> linked to what the HW can do.
>
> ib_umem_find_best_pgsz() is used for MRs that have the usual
> offset = IOVA % pgsz
>
> We've always created other helpers for other restrictions.
>
> So you should move your "force_zero_offset" into another helper and
> describe exactly how the HW works to support the calculation
>
> It is odd to have the offset loop and be using
> ib_umem_find_best_pgsz() with some iova, usually you'd use
> ib_umem_find_best_pgoff() in those cases, see the other callers.

Hi Jason,
Thanks for the comments.

To be honest, I do not understand how I could employ ib_umem_find_best_pgoff
for my purpose. As well as I do not see any mistake in the patch, and I think you neither.

I can explain the logic behind the code, and could you say what I need to change to get
this patch accepted?

Our HW cannot create a work queue or a completion queue from a dma region that has
non zero offset. As a result, applications must allocate at least 4K-aligned memory for
such queues. As a queue can be long, the helper functions may suggest large page sizes
(let's call it X), but compromise the zero offset. As we see that the offset is not zero, we
half X till first page size that can offer us zero offset. This page size will be at least 4K.
If we cannot find such page size, it means that the user did not provide 4K-aligned buffer.

I can make a special helper, but I do not think that it will be useful to anyone. Plus,
there is no better approach then halving the page size, so the helper will end up with that
loop under the hood. As I see mlnx also uses a loop with halving page_sz, but for a different
purpose, I do not see why our code cannot do the same without a special helper.

>
> Jason