Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

From: Eric W. Biederman
Date: Tue Oct 20 2015 - 15:05:06 EST


Alexei Starovoitov <ast@xxxxxxxxxxxx> writes:

> On 10/20/15 1:46 AM, Daniel Borkmann wrote:
>>> as we discussed in this thread and earlier during plumbers I think
>>> it would be good to expose key/values somehow in this fs.
>>> 'how' is a big question.
>>
>> Yes, it is a big question, and probably best left to the domain-specific
>> application itself, which can already dump the map nowadays via bpf(2)
>> syscall. You can add bindings to various languages to make it available
>> elsewhere as well.
>>
>> Or, you have a user space 'bpf' tool that can connect to any map that is
>> being exposed with whatever model, and have modular pretty printers in
>> user space somewhere located as shared objects, they could get auto-loaded
>> in the background. Maps could get an annotation attached as an attribute
>> during creation that is being exposed somewhere, so it can be mapped to
>> a pretty printer shared object. This would better be solved in user space
>> entirely, in my opinion, why should the kernel add complexity for this
>> when this is so much user-space application specific anyway?
>>
>> As we all agreed, looking into key/values via shell is a rare event and
>> not needed most of the times. It comes with it's own problems (f.e. think
>> of dumping a possible rhashtable map with key/values as files). But even
>> iff we'd want to stick this into files by all means, fusefs can do this
>> specific job entirely in user space _plus_ fetching these shared objects
>> for pretty printers etc, all we need for this is to add this annotation/
>> mapping attribute somewhere to bpf_maps and that's all it takes.
>>
>> This question is no doubt independant of the fd pinning mechanism, but as
>> I said, I don't think sticking this into the kernel is a good idea. Why
>> would that be the kernel's job?
>
> agree with all of the concerns above. I said it would be good for
> kernel to expose key/values and I still think it would be a useful
> feature. Regardless whether kernel does it or not in the future,
> the point was 'IF we want kernel to do it then bpf FS is the right way'.
>
>> In the other email, you are mentioning fdinfo. fdinfo can be done for any
>> map/prog already today by just adding the right .show_fdinfo() callback to
>> bpf_map_fops and bpf_prog_fops, so we let the anon-inodes that we already
>> use today to do this job for free and such debugging info can be inspected
>> through procfs already. This is common practice, f.e. look at timerfd,
>> signalfd and others.
>
> I know. That's exactly what I proposed, but again the point was
> that fdinfo of regular FDs should match in style to pinned FDs,
> 'cat /sys/kernel/bpf/.../map5' should be similar to
> 'cat /proc/.../fdinfo/5'
> and 'cat /sys/kernel/bpf...' you can only cleanly do with bpffs.
>
>>> But regardless which path we take, sysfs is too rigid.
>>> For the sake of argument say we do every key as a new file in bpffs.
>>> It's not very scalable, but comparing to sysfs it's better
>>> (resource wise).
>>
>> I doubt this is scaleable at all, no matter if its sysfs or a own custom
>> fs. How should that work. You have a map with possibly thousands or
>> millions
>> of entries. Are these files to be generated on the fly like in procfs as
>> soon as you enter that directory? Or as a one-time snapshot (but then
>> the user mights want to create various snapshots)? There might be new
>> map elements as building blocks in the future such as pipes, ring buffers
>> etc. How are they being dumped as files?
>
> you're arguing that keys as files are not scalable. sure.
> See what I said above "it's not very scalable"
> The point is that fs approach is more flexible comparing to cdev.
>
>>> not everything in unix is a model that should be followed.
>>> af_unix with name[0]!=0 is a bad api that wasn't thought through.
>>> Thankfully Linux improved it with abstract names that don't use
>>> special files.
>>> bpf maps obviously is not an IPC (either pinned or not).
>>
>> So, if this pinning facility is unprivileged and available for *all*
>> applications, then applications can in-fact use eBPF maps (w/o any
>> other aides such as Unix domain sockets to transfer fds) among themselves
>> to exchange state via bpf(2) syscall. It doesn't need a corresponding
>> program.
>
> Obviously I know that, but it doesn't make it an IPC.
> Just because two processes can talk to each other via normal tcpip it
> doesn't make tcpip an IPC mechanism.
> The point is "just because two processes can communicate with each
> other via X (bpf maps) we are not going to optimize (or make
> architectural decisions in X) just for this use case". It's a job of
> generic IPC and we have enough of them already.
>
>> Okay, sure, but then having a mount_single() and separating users and
>> namespaces is still not being resolved, as you've noticed.
>
> yes and that's what I proposed to do:
> Tweaking this FS patch to do mount_single() and define directory
> structure is the best way forward.
>
>> So, if you distribute the names through the kernel and dictate a strict
>> hierarchy, then we'll end up with a similar model that cdevs resolve.
>
> yes. exactly.
> but comparing to cdev, it will be:
> - cheaper for kernel to keep (memory wise)
> - faster to pin FDs
> - do normal 'rm' to destroy
> - possible to extend to unprivileged users
> - possible to add fdinfo (same output for pinned and normal fd)
> - possible to expose key/value
>
> I'm puzzled how you can keep arguing in favor of cdev when it's
> obviously deficient comparing to fs and fs has no disadvantages.
> Looks like we can only resolve it over beer.
> How about we setup a public hangout ? Today or tomorrow?

Just FYI: Using a device for this kind of interface is pretty
much a non-starter as that quickly gets you into situations where
things do not work in containers. If someone gets a version of device
namespaces past GregKH it might be up for discussion to use character
devices.

But really device nodes are a technology that is slowly being changed to
support hotplug. Nothing you are doing seems to match up well with
devices. So for an interface that you want ordinary applications to use
character devices are a bad bad fit.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/