Re: [PATCH v5 02/10] block: Add copy offload support infrastructure

From: Nitesh Shetty
Date: Tue Dec 13 2022 - 00:17:13 EST


On Wed, Dec 07, 2022 at 07:19:00PM +0800, Ming Lei wrote:
> On Wed, Dec 07, 2022 at 11:24:00AM +0530, Nitesh Shetty wrote:
> > On Tue, Nov 29, 2022 at 05:14:28PM +0530, Nitesh Shetty wrote:
> > > On Thu, Nov 24, 2022 at 08:03:56AM +0800, Ming Lei wrote:
> > > > On Wed, Nov 23, 2022 at 03:37:12PM +0530, Nitesh Shetty wrote:
> > > > > On Wed, Nov 23, 2022 at 04:04:18PM +0800, Ming Lei wrote:
> > > > > > On Wed, Nov 23, 2022 at 11:28:19AM +0530, Nitesh Shetty wrote:
> > > > > > > Introduce blkdev_issue_copy which supports source and destination bdevs,
> > > > > > > and an array of (source, destination and copy length) tuples.
> > > > > > > Introduce REQ_COPY copy offload operation flag. Create a read-write
> > > > > > > bio pair with a token as payload and submitted to the device in order.
> > > > > > > Read request populates token with source specific information which
> > > > > > > is then passed with write request.
> > > > > > > This design is courtesy Mikulas Patocka's token based copy
> > > > > >
> > > > > > I thought this patchset is just for enabling copy command which is
> > > > > > supported by hardware. But turns out it isn't, because blk_copy_offload()
> > > > > > still submits read/write bios for doing the copy.
> > > > > >
> > > > > > I am just wondering why not let copy_file_range() cover this kind of copy,
> > > > > > and the framework has been there.
> > > > > >
> > > > >
> > > > > Main goal was to enable copy command, but community suggested to add
> > > > > copy emulation as well.
> > > > >
> > > > > blk_copy_offload - actually issues copy command in driver layer.
> > > > > The way read/write BIOs are percieved is different for copy offload.
> > > > > In copy offload we check REQ_COPY flag in NVMe driver layer to issue
> > > > > copy command. But we did missed it to add in other driver's, where they
> > > > > might be treated as normal READ/WRITE.
> > > > >
> > > > > blk_copy_emulate - is used if we fail or if device doesn't support native
> > > > > copy offload command. Here we do READ/WRITE. Using copy_file_range for
> > > > > emulation might be possible, but we see 2 issues here.
> > > > > 1. We explored possibility of pulling dm-kcopyd to block layer so that we
> > > > > can readily use it. But we found it had many dependecies from dm-layer.
> > > > > So later dropped that idea.
> > > >
> > > > Is it just because dm-kcopyd supports async copy? If yes, I believe we
> > > > can reply on io_uring for implementing async copy_file_range, which will
> > > > be generic interface for async copy, and could get better perf.
> > > >
> > >
> > > It supports both sync and async. But used only inside dm-layer.
> > > Async version of copy_file_range can help, using io-uring can be helpful
> > > for user , but in-kernel users can't use uring.
> > >
> > > > > 2. copy_file_range, for block device atleast we saw few check's which fail
> > > > > it for raw block device. At this point I dont know much about the history of
> > > > > why such check is present.
> > > >
> > > > Got it, but IMO the check in generic_copy_file_checks() can be
> > > > relaxed to cover blkdev cause splice does support blkdev.
> > > >
> > > > Then your bdev offload copy work can be simplified into:
> > > >
> > > > 1) implement .copy_file_range for def_blk_fops, suppose it is
> > > > blkdev_copy_file_range()
> > > >
> > > > 2) inside blkdev_copy_file_range()
> > > >
> > > > - if the bdev supports offload copy, just submit one bio to the device,
> > > > and this will be converted to one pt req to device
> > > >
> > > > - otherwise, fallback to generic_copy_file_range()
> > > >
> > >
> >
> > Actually we sent initial version with single bio, but later community
> > suggested two bio's is must for offload, main reasoning being
>
> Is there any link which holds the discussion?
>

This[1] is the starting thread for LSF/MM which Chaitanya organized and it
was a conference call. Also in 2022 LSM/MM there was a discussion on copy
topic as well. One of the main conclusion was using 2 bio's as must have.

Bart has summarized previous copy efforts[2] and Martin too [3],
which might be of help in understanding why 2 bio's are must.

> > dm-layer,Xcopy,copy across namespace compatibilty.
>
> But dm kcopy has supported bdev copy already, so once your patch is
> ready, dm kcopy can just sends one bio with REQ_COPY if the device
> supports offload command, otherwise the current dm kcopy code can work
> as before.
>
> >
> > > We will check the feasibilty and try to implement the scheme in next versions.
> > > It would be helpful, if someone in community know's why such checks were
> > > present ? We see copy_file_range accepts only regular file. Was it
> > > designed only for regular files or can we extend it to regular block
> > > device.
> > >
> >
> > As you suggested we were able to integrate def_blk_ops and
> > run with user application, but we see one main issue with this approach.
> > Using blkdev_copy_file_range requires having 2 file descriptors, which
> > is not possible for in kernel users such as fabrics/dm-kcopyd which has
> > only bdev descriptors.
> > Do you have any plumbing suggestions here ?
>
> What is the fabrics kernel user?

In fabrics setup we are exposing target side NVMe device, to host as
copy offload capable device, irrespective of target device's capability.
This will enable sending copy offload command from host.
>From target side, if device doesn't support offload then we emulate.
This way we will avoid data copy over the network.

> Any kernel target code(such as nvme target)
> has file/bdev path available, vfs_copy_file_range() should be fine.
>

>From target side, fabrics device can be created in 2 ways,
1. file backed device: In this setup we get file descriptor. In fact in our
patches[4] are using vfs_copy_file_range to offload copy.
2. block backed device: Here we have only blockdev descriptor.
This setup we can't use vfs_copy_file_range since we are missing file
descriptor.
Your input on how we can plumb blkdev_copy_file_range with bdev would help.

> Also IMO, kernel copy user shouldn't be important long term, especially if
> io_uring copy_file_range() can be supported, forwarding to userspace not
> only gets better performance, but also cleanup kernel related copy code
> much.
>

Using uring is valid if consumer of copy_file_range is userspace.
But there are many possible in-kernel user of copy, such as dm-kcopyd,
garbage collection case such as F2FS GC, also fabrics as explained above.
We can't use uring from these layers.

Present implementation of vfs_copy_file_range(CFR) lacks
a. async implementation, which helps in nvme over fabrics
b. multi range(src,dst pair) support
So here we see going to blkdev_copy_file_range actually regress our present
result. But we do see a goodness of using mature vfs_copy_file_range for
emulation.

[1] https://lore.kernel.org/linux-nvme/BYAPR04MB49652C4B75E38F3716F3C06386539@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[2] https://github.com/bvanassche/linux-kernel-copy-offload
[3] https://oss.oracle.com/~mkp/docs/xcopy.pdf
[4] https://lore.kernel.org/lkml/20221130041450.GA17533@test-zns/T/#Z2e.:..:20221123055827.26996-7-nj.shetty::40samsung.com:1drivers:nvme:target:io-cmd-file.c

Thanks,
Nitesh Shetty