Re: [RFC PATCH v3 28/37] WIP: bpf: Add fuse_ops struct_op programs

From: Andrii Nakryiko
Date: Wed May 03 2023 - 14:22:40 EST


On Tue, May 2, 2023 at 6:53 PM Daniel Rosenberg <drosen@xxxxxxxxxx> wrote:
>
> On Wed, Apr 26, 2023 at 9:18 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > Have you considered grouping this huge amount of callbacks into a
> > smaller set of more generic callbacks where each callback would get
> > enum argument specifying what sort of operation it is called for? This
> > has many advantages, starting from not having to deal with struct_ops
> > limits, ending with not needing to instantiate dozens of individual
> > BPF programs.
> >
> > E.g., for a lot of operations the difference between pre- and
> > post-filter is in having in argument as read-only and maybe having
> > extra out argument for post-filter. One way to unify such post/pre
> > filters into one callback would be to record whether in has to be
> > read-only or read-write and not allow to create r/w dynptr for the
> > former case. Pass bool or enum specifying if it's post or pre filter.
> > For that optional out argument, you can simulate effectively the same
> > by always supplying it, but making sure that out parameter is
> > read-only and zero-sized, for example.
> >
> > That would cut the number of callbacks in two, which I'd say still is
> > not great :) I think it would be better still to have even larger
> > groups of callbacks for whole families of operations with the same (or
> > "unifiable") interface (domain experts like you would need to do an
> > analysis here to see what makes sense to group, of course).
> >
> > We'll probably touch on that tomorrow at BPF office hours, but I
> > wanted to point this out beforehand, so that you have time to think
> > about it.
> >
>
> The meta info struct we pass in includes the opcode which contains
> whether it is a prefilter or postfilter, although I guess that may be
> less accessible to the verifier than a separate bool. In the v1
> version, we handled all op codes in a single program, although I think
> we were running into some slowdowns when we had every opcode in a
> giant switch statement, plus we were incurring the cost of the bpf
> program even when we didn't need to do anything in it. The struct_op
> version lets us entirely skip calling the bpf for opcodes we don't
> need to handle.
>
> Many of the arguments we pass currently are structs. If they were all
> dynptrs, we could set the output related ones to empty/readonly, but
> that removes one of the other strengths of the struct_op setup, where
> we can actually label the inputs as the structs they are instead of a
> void* equivalent. There are definitely some cases where we could
> easily merge opcode callbacks, like FUSE_FSYNCDIR/FUSE_FSYNC and
> FUSE_OPEN/FUSE_OPENDIR. I set them up as separate since it's easy to
> assign the same program to both callbacks in the case where you want
> both to be handled the same, while maintaining flexibility to handle
> them separately.

If combining hooks doesn't bring any value and simplification, I think
it's fine to keep it as is. I was mostly probing if there is an
equally convenient, but more succinct API that could be exposed
through struct_ops. If there is none, then it's fine.

>
> > +void bpf_fuse_get_rw_dynptr(struct fuse_buffer *buffer, struct bpf_dynptr_kern *dynptr__uninit, u64 size, bool copy)
> >
> > not clear why size is passed from outside instead of instantiating
> > dynptr with buffer->size? See [0] for bpf_dynptr_adjust and
> > bpf_dynptr_clone that allow you to adjust buffer as necessary.
> >
> > As for the copy parameter, can you elaborate on the idea behind it?
> >
> > [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=741584&state=*
> >
>
> We're storing these buffers as fuse_buffers initially because of the
> additional metadata we're carrying. Some fields have variable lengths,
> or are backed by const data. For instance, names. If you wanted to
> alter the name we use on the lower filesystem, you cannot change it
> directly since it's being backed by the dentry name. If you wanted to
> adjust something, like perhaps adding an extension, you would pass
> bpf_fuse_get_rw_dynptr the size you'd want for the new buffer, and
> copy=true to get the preexisting data. Fuse_buffer tracks that data
> was allocated so Fuse can clean up after the call. Additionally, say
> you wanted to trim half the data returned by an xattr for some reason.
> You would give it a size less than the buffer size to inform fuse that
> it should ignore the second half of the data. That part could be
> handled by bpf_dynptr_adjust if we didn't also need to handle the
> allocation case.

Interesting point about allocations and needing to realloc names. But
I wonder if it makes more sense to split the copy/reallocation part
and do it with separate kfunc. And leave dynptr only as means to work
with that data. So you'd do something like below for read/write case:

bpf_fuse_buf_clone(&buffer, new_size);
bpf_fuse_dynptr_from_buf_rw(&buffer, &dynptr);

But would skip bpf_fuse_buf_clone() if you only ever read:

bpf_fuse_dynptr_from_buf_ro(&buffer, &dynptr);

If fuse_buffer was never cloned/realloced, then
bpf_fuse_dynptr_from_buf_rw() should just fail and return invalid
dynptr.


> Say you wanted to have the lower file name be the hash of the one you
> created. In that case, you could get bpf_fuse_get_ro_dynptr to get
> access to compute the hash, and then bpf_fuse_get_rw_dynptr to get a
> buffer to write the hash to. Since the data is not directly related to
> the original data, there would be no benefit to getting a copy.
>
> I initially intended for bpf_fuse_get_ro_dynptr/bpf_fuse_get_rw_dynptr
> to be called at most once for each field, but that may be too
> restrictive. At the moment, if you make two calls that require
> reallocating, any pointers to the old buffer would be invalid. This is
> not the case for the original name, as we aren't touching the original
> source. There are two possible approaches here. I could either
> refcount the buffer and have a put kfunc, or I could invalidate old
> dynpointers when bpf_fuse_get_rw_dynptr is called, similar to what
> skb/xdp do. I'm leaning towards the latter to disallow having many
> allocations active at once by calling bpf_fuse_get_rw_dynptr for
> increasing sizes, though I could also just disallow reallocating a
> buffer that already was reallocated.

Yes, invalidating dynptrs sounds like a way to go. But I think instead
of bundling all that into dynptr constructor for fuse_buffer, it's
better to have a separate kfunc that would be doing realloc/cloning
*and* invalidating. Other than that, neither from_buf_rw nor
from_buf_ro should be doing invalidation, because they can't cause
realloc. WDYT?

>
> The new dynptr helpers are pretty exciting since they'll make it much
> easier to deal with chunks of data, which we may end up doing in
> read/write filters. I haven't fully set those up since I was waiting
> to see what the dynptr helpers ended up looking like.
>

Great, let us know how it goes in practice to start using them.

>
> > > +void bpf_fuse_get_ro_dynptr(const struct fuse_buffer *buffer, struct bpf_dynptr_kern *dynptr__uninit)
> >
> > these kfuncs probably should be more consistently named as
> > bpf_dynptr_from_fuse_buffer_{ro,rw}() ?
> >
> Yeah, that fits in much better with the skb/xdp functions.

great

>
> > > +
> > > +uint32_t bpf_fuse_return_len(struct fuse_buffer *buffer)
> > > +{
> > > + return buffer->size;
> >
> > you should be able to get this with bpf_dynptr_size() (once you create
> > it from fuse_buffer).
> >
>
> Yes, this might be unnecessary. I added it while testing kfuncs, and
> had intended to use it with a fuse_buffer strncmp before I saw that
> there's now a bpf_strncmp :) I had tried using it with
> bpf_dynptr_slice, but that requires a known constant at verification
> time, which may make using it in real cases a bit difficult...
> bpf_strncmp also has some restrictions around the second string being
> a fixed map, or something like that.

right, we might need a more flexible strncmp version working with two
dynptrs and not assuming a fixed string. We didn't have dynptr
abstraction for working with variable-sized memory when we were adding
bpf_strncmp.

>
> >
> > you probably should be fine with just using bpf_tracing_func_proto as is
> >
> > > + .is_valid_access = bpf_fuse_is_valid_access,
> >
> > similarly, why custom no-op callback?
> >
>
> Those are largely carried over from iterations when I was less sure
> what I would need. A lot of the work I was doing in the v1 code is
> handled by default with the struct_op setup now, or is otherwise
> unnecessary. This area in particular needs a lot of cleanup.
>

ok

> > > +static int bpf_fuse_init(struct btf *btf)
> > > +{
> > > + s32 type_id;
> > > +
> > > + type_id = btf_find_by_name_kind(btf, "fuse_buffer", BTF_KIND_STRUCT);
> > > + if (type_id < 0)
> > > + return -EINVAL;
> > > + fuse_buffer_struct_type = btf_type_by_id(btf, type_id);
> > > +
> >
> > see BTF_ID and BTF_ID_LIST uses for how to get ID for your custom
> > well-known type
> >
> Thanks, I'll look into those.