Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit

From: Bernd Schubert
Date: Thu Mar 28 2024 - 18:20:27 EST




On 3/28/24 17:46, Sweet Tea Dorminy wrote:
>
>
> On 3/7/24 17:06, Bernd Schubert wrote:
>> Hi Jingbo,
>>
>> On 3/7/24 03:16, Jingbo Xu wrote:
>>> Hi Bernd,
>>>
>>> On 3/6/24 11:45 PM, Bernd Schubert wrote:
>>>>
>>>>
>>>> On 3/6/24 14:32, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>>>>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu
>>>>>> <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Hi Miklos,
>>>>>>>
>>>>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu
>>>>>>>>>> <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> From: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
>>>>>>>>>>>
>>>>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data
>>>>>>>>>>> size of a
>>>>>>>>>>> single request is increased.
>>>>>>>>>>
>>>>>>>>>> The only worry is about where this memory is getting accounted
>>>>>>>>>> to.
>>>>>>>>>> This needs to be thought through, since the we are increasing the
>>>>>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>>>>>
>>>>>>>> Apart from the request size, the maximum number of background
>>>>>>>> requests,
>>>>>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>>>>>> daemon), also limits the size of the memory that an unprivileged
>>>>>>>> user
>>>>>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>>>>>> increasing the maximum request size.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This optimizes the write performance especially when the
>>>>>>>>>>> optimal IO size
>>>>>>>>>>> of the backend store at the fuse daemon side is greater than
>>>>>>>>>>> the original
>>>>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>>>>>> 4096 PAGE_SIZE).
>>>>>>>>>>>
>>>>>>>>>>> Be noted that this only increases the upper limit of the
>>>>>>>>>>> maximum request
>>>>>>>>>>> size, while the real maximum request size relies on the
>>>>>>>>>>> FUSE_INIT
>>>>>>>>>>> negotiation with the fuse daemon.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@xxxxxxxxxxxxxxx>
>>>>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx>
>>>>>>>>>>> ---
>>>>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>>>>>> Bytedance floks seems to had increased the maximum request
>>>>>>>>>>> size to 8M
>>>>>>>>>>> and saw a ~20% performance boost.
>>>>>>>>>>
>>>>>>>>>> The 20% is against the 256 pages, I guess.
>>>>>>>>>
>>>>>>>>> Yeah I guess so.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> It would be interesting to
>>>>>>>>>> see the how the number of pages per request affects
>>>>>>>>>> performance and
>>>>>>>>>> why.
>>>>>>>>>
>>>>>>>>> To be honest, I'm not sure the root cause of the performance
>>>>>>>>> boost in
>>>>>>>>> bytedance's case.
>>>>>>>>>
>>>>>>>>> While in our internal use scenario, the optimal IO size of the
>>>>>>>>> backend
>>>>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the
>>>>>>>>> maximum
>>>>>>>>> throughput can not be achieved with current 256 pages per
>>>>>>>>> request. IOW
>>>>>>>>> the backend store, e.g. a distributed parallel filesystem, get
>>>>>>>>> optimal
>>>>>>>>> performance when the data is aligned at 4MB boundary.  I can
>>>>>>>>> ask my folk
>>>>>>>>> who implements the fuse server to give more background info and
>>>>>>>>> the
>>>>>>>>> exact performance statistics.
>>>>>>>>
>>>>>>>> Here are more details about our internal use case:
>>>>>>>>
>>>>>>>> We have a fuse server used in our internal cloud scenarios,
>>>>>>>> while the
>>>>>>>> backend store is actually a distributed filesystem.  That is,
>>>>>>>> the fuse
>>>>>>>> server actually plays as the client of the remote distributed
>>>>>>>> filesystem.  The fuse server forwards the fuse requests to the
>>>>>>>> remote
>>>>>>>> backing store through network, while the remote distributed
>>>>>>>> filesystem
>>>>>>>> handles the IO requests, e.g. process the data from/to the
>>>>>>>> persistent store.
>>>>>>>>
>>>>>>>> Then it comes the details of the remote distributed filesystem
>>>>>>>> when it
>>>>>>>> process the requested data with the persistent store.
>>>>>>>>
>>>>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>>>>>> (ErasureCode), where each fixed sized user data is split and
>>>>>>>> stored as 8
>>>>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>>>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>>>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>>>>>
>>>>>>>> It also utilize the stripe technology to boost the performance, for
>>>>>>>> example, there are 8 data disks and 3 parity disks in the above
>>>>>>>> 8+3 mode
>>>>>>>> example, in which each stripe consists of 8 data blocks and 3
>>>>>>>> parity
>>>>>>>> blocks.
>>>>>>>>
>>>>>>>> [2] To avoid data corruption on power off, the remote distributed
>>>>>>>> filesystem commit a O_SYNC write right away once a write (fuse)
>>>>>>>> request
>>>>>>>> received.  Since the EC described above, when the write fuse
>>>>>>>> request is
>>>>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in
>>>>>>>> size, the
>>>>>>>> other 3MB is read from the persistent store first, then compute the
>>>>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally
>>>>>>>> write
>>>>>>>> the 8 data blocks and 3 parity blocks down.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thus the write amplification is un-neglectable and is the
>>>>>>>> performance
>>>>>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>>>>>
>>>>>>>> Here are some simple performance statistics with varying request
>>>>>>>> size.
>>>>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the
>>>>>>>> maximum
>>>>>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>>>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>>>>>
>>>>>> I sort of understand the issue, although my guess is that this could
>>>>>> be worked around in the client by coalescing writes.  This could be
>>>>>> done by adding a small delay before sending a write request off to
>>>>>> the
>>>>>> network.
>>>>>>
>>>>>> Would that work in your case?
>>>>>
>>>>> It's possible but I'm not sure. I've asked my colleagues who
>>>>> working on
>>>>> the fuse server and the backend store, though have not been replied
>>>>> yet.
>>>>>   But I guess it's not as simple as increasing the maximum FUSE
>>>>> request
>>>>> size directly and thus more complexity gets involved.
>>>>>
>>>>> I can also understand the concern that this may increase the risk of
>>>>> pinning more memory footprint, and a more generic using scenario needs
>>>>> to be considered.  I can make it a private patch for our internal
>>>>> product.
>>>>>
>>>>> Thanks for the suggestions and discussion.
>>>>
>>>> It also gets kind of solved in my fuse-over-io-uring branch - as
>>>> long as
>>>> there are enough free ring entries. I'm going to add in a flag there
>>>> that other CQEs might be follow up requests. Really time to post a new
>>>> version.
>>>
>>> Thanks for the information.  I've not read the fuse-over-io-uring branch
>>> yet, but sounds like it would be much helpful .  Would there be a flag
>>> in the FUSE request indicating it's one of the linked FUSE requests?  Is
>>> this feature, say linked FUSE requests, enabled only when io-uring is
>>> upon FUSE?
>>
>>
>> Current development branch is this
>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.8
>> (It sometimes gets rebase/force pushes and incompatible changes - the
>> corresponding libfuse branch is also persistently updated).
>>
>> Patches need clean up before I can send the next RFC version. And I
>> first want to change fixed single request size (not so nice to use 1MB
>> requests when 4K would be sufficient, for things like metadata and small
>> IO).
>>
>
> Let me know if there's something you'd like collaboration on --
> fuse_iouring sounds very exciting and I'd love to help out any way that
> would be useful.

With pleasure, I take whatever help you offer. Right now I'm quite
jumping between between different projects and I'm not too happy that I
still didn't sent out a new patch version yet. (And the atomic-open
branch also needs updates).

>
> For our internal usecase at Meta, the relevant backend store operates on
> 8M chunks, so I'm also very interested in the simplicity of just opting
> in to receiving 8M IOs from the kernel instead of needing to buffer our
> own 8MB IOs. But io_uring does seem like a plausible general-purpose
> improvement too, so either or both of these paths is interesting and I'm
> working on gathering performance numbers on the relative merits.

Merging requests requires a bit scanning through the CQEs on the
userspace side, it all arrives randomly. I haven't even tried yet to
merge requests, I have just seen with debugging that ring the queue gets
filled with requests that belong together.

Out of interest, are you using libfuse or your own kernel interface
library? I would be quite interested to know if the fuse-uring
kernel/userspace and then libfuse interface matches your needs. Example,
our next-gen DDN file system runs in spdk reactor context and I had to
update our own code base and libfuse to support ring polling. So one
project outside of libfuse example/ and already some changes needed...
Another change I haven't implemented yet in libfuse is ring request
buffer registration with the file system (for network rdma).

Btw, I just run into bug that came up with FUSE_CAP_WRITEBACK_CACHE - I
definitely don't claim that all code paths are perfectly tested already
(fixed now in the fuse-uring-for-6.8 branch).


Thanks,
Bernd