RE: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE

From: Daisuke Matsuda (Fujitsu)
Date: Wed Apr 19 2023 - 20:29:06 EST


On Thu, April 20, 2023 1:07 AM Pearson, Robert B wrote:
>
> The work queue patch has been submitted and is waiting for some action. -- Bob

Hi,
Could you tell me which is it? I am willing to review it.

This seems to be your latest work queue patch:
https://lore.kernel.org/all/TYCPR01MB8455A2D0B3303FD90B3BB6F1E58B9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
I cannot find any one newer on the mailing list nor on the Patchwork.

Daisuke

>
> -----Original Message-----
> From: Daisuke Matsuda <matsuda-daisuke@xxxxxxxxxxx>
> Sent: Wednesday, April 19, 2023 12:52 AM
> To: linux-rdma@xxxxxxxxxxxxxxx; leonro@xxxxxxxxxx; jgg@xxxxxxxxxx; zyjzyj2000@xxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx; rpearsonhpe@xxxxxxxxx; yangx.jy@xxxxxxxxxxx; lizhijian@xxxxxxxxxxx; Daisuke
> Matsuda <matsuda-daisuke@xxxxxxxxxxx>
> Subject: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE
>
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe) driver, which has been available only in
> mlx5 driver[1] so far.
>
> The first patch of this series is provided for testing purpose, and it should be dropped in the end. It converts triple tasklets
> to use workqueue in order to let them sleep during page-fault. Bob Pearson says he will post the patch to do this, and I
> think we can adopt that. The other patches in this series are, I believe, completed works.
>
> I omitted some contents like the motive behind this series for simplicity.
> Please see the cover letter of v3 for more details[2].
>
> [Overview]
> When applications register a memory region(MR), RDMA drivers normally pin pages in the MR so that physical addresses
> are never changed during RDMA communication. This requires the MR to fit in physical memory and inevitably leads to
> memory pressure. On the other hand, On-Demand Paging
> (ODP) allows applications to register MRs without pinning pages. They are paged-in when the driver requires and
> paged-out when the OS reclaims. As a result, it is possible to register a large MR that does not fit in physical memory
> without taking up so much physical memory.
>
> [How does ODP work?]
> "struct ib_umem_odp" is used to manage pages. It is created for each ODP-enabled MR on its registration. This struct
> holds a pair of arrays
> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and PFNs are stored in the driver page table. They
> are updated on page-in and page-out, both of which use the common interfaces in the ib_uverbs layer.
>
> Page-in can occur when requester, responder or completer access an MR in order to process RDMA operations. If they
> find that the pages being accessed are not present on physical memory or requisite permissions are not set on the pages,
> they provoke page fault to make the pages present with proper permissions and at the same time update the driver page
> table.
> After confirming the presence of the pages, they execute memory access such as read, write or atomic operations.
>
> Page-out is triggered by page reclaim or filesystem events (e.g. metadata update of a file that is being used as an MR).
> When creating an ODP-enabled MR, the driver registers an MMU notifier callback. When the kernel issues a page
> invalidation notification, the callback is provoked to unmap DMA addresses and update the driver page table. After that,
> the kernel releases the pages.
>
> [Supported operations]
> All traditional operations are supported on RC connection. The new Atomic write[3] and RDMA Flush[4] operations are
> not included in this patchset. I will post them later after this patchset is merged. On UD connection, Send, Recv, and
> SRQ-Recv are supported.
>
> [How to test ODP?]
> There are only a few resources available for testing. pyverbs testcases in rdma-core and perftest[5] are recommendable
> ones. Other than them, the ibv_rc_pingpong command can also used for testing. Note that you may have to build perftest
> from upstream because older versions do not handle ODP capabilities correctly.
>
> The tree is available from github:
> https://github.com/daimatsuda/linux/tree/odp_v4
> While this series is based on commit f605f26ea196, the tree includes an additional bugfix, which is yet to be merged as of
> today (Apr 19th, 2023).
> https://lore.kernel.org/linux-rdma/20230418090642.1849358-1-matsuda-daisuke@xxxxxxxxxxx/
>
> [Future work]
> My next work is to enable the new Atomic write[3] and RDMA Flush[4] operations with ODP. After that, I am going to
> implement the prefetch feature. It allows applications to trigger page fault using
> ibv_advise_mr(3) to optimize performance. Some existing software like librpma[6] use this feature. Additionally, I think we
> can also add the implicit ODP feature in the future.
>
> [1] [RFC 00/20] On demand paging
> https://www.spinics.net/lists/linux-rdma/msg18906.html
>
> [2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
> https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@xxxxxxxxxxx/
>
> [3] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
> https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@xxxxxxxxxxx/
>
> [4] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
> https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@xxxxxxxxxxx/
>
> [5] linux-rdma/perftest: Infiniband Verbs Performance Tests https://github.com/linux-rdma/perftest
>
> [6] librpma: Remote Persistent Memory Access Library https://github.com/pmem/rpma
>
> v3->v4:
> 1) Re-designed functions that access MRs to use the MR xarray.
> 2) Rebased onto the latest jgg-for-next tree.
>
> v2->v3:
> 1) Removed a patch that changes the common ib_uverbs layer.
> 2) Re-implemented patches for conversion to workqueue.
> 3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
> 4) Fixed some functions that returned incorrect errors.
> 5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
>
> v1->v2:
> 1) Fixed a crash issue reported by Haris Iqbal.
> 2) Tried to make lock patters clearer as pointed out by Romanovsky.
> 3) Minor clean ups and fixes.
>
> Daisuke Matsuda (8):
> RDMA/rxe: Tentative workqueue implementation
> RDMA/rxe: Always schedule works before accessing user MRs
> RDMA/rxe: Make MR functions accessible from other rxe source code
> RDMA/rxe: Move resp_states definition to rxe_verbs.h
> RDMA/rxe: Add page invalidation support
> RDMA/rxe: Allow registering MRs for On-Demand Paging
> RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
> RDMA/rxe: Add support for the traditional Atomic operations with ODP
>
> drivers/infiniband/sw/rxe/Makefile | 2 +
> drivers/infiniband/sw/rxe/rxe.c | 27 ++-
> drivers/infiniband/sw/rxe/rxe.h | 37 ---
> drivers/infiniband/sw/rxe/rxe_comp.c | 12 +-
> drivers/infiniband/sw/rxe/rxe_loc.h | 49 +++-
> drivers/infiniband/sw/rxe/rxe_mr.c | 27 +--
> drivers/infiniband/sw/rxe/rxe_odp.c | 311 ++++++++++++++++++++++++++
> drivers/infiniband/sw/rxe/rxe_recv.c | 4 +-
> drivers/infiniband/sw/rxe/rxe_resp.c | 32 ++- drivers/infiniband/sw/rxe/rxe_task.c | 84 ++++---
> drivers/infiniband/sw/rxe/rxe_task.h | 6 +-
> drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
> drivers/infiniband/sw/rxe/rxe_verbs.h | 39 ++++
> 13 files changed, 535 insertions(+), 100 deletions(-) create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>
> base-commit: f605f26ea196a3b49bea249330cbd18dba61a33e
>
> --
> 2.39.1