Re: Potential issues (security and otherwise) with the current cgroup-bpf API

From: Alexei Starovoitov
Date: Mon Dec 19 2016 - 19:03:14 EST


On Mon, Dec 19, 2016 at 01:23:50PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 19, 2016 at 12:56 PM, Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> > On Sat, Dec 17, 2016 at 10:18:44AM -0800, Andy Lutomirski wrote:
> >> Hi all-
> >>
> >> I apologize for being rather late with this. I didn't realize that
> >> cgroup-bpf was going to be submitted for Linux 4.10, and I didn't see
> >> it on the linux-api list, so I missed the discussion.
> >>
> >> I think that the inet ingress, egress etc filters are a neat feature,
> >> but I think the API has some issues that will bite us down the road
> >> if it becomes stable in its current form.
> >>
> >> Most of the problems I see are summarized in this transcript:
> >>
> >> # mkdir cg2
> >> # mount -t cgroup2 none cg2
> >> # mkdir cg2/nosockets
> >> # strace cgrp_socket_rule cg2/nosockets/ 0
> >> ...
> >> open("cg2/nosockets/", O_RDONLY|O_DIRECTORY) = 3
> >>
> >> ^^^^ You can modify a cgroup after opening it O_RDONLY?
> >>
> >> bpf(BPF_PROG_LOAD, {prog_type=0x9 /* BPF_PROG_TYPE_??? */, insn_cnt=2,
> >> insns=0x7fffe3568c10, license="GPL", log_level=1, log_size=262144,
> >> log_buf=0x6020c0, kern_version=0}, 48) = 4
> >>
> >> ^^^^ This is fine. The bpf() syscall manipulates bpf objects.
> >>
> >> bpf(0x8 /* BPF_??? */, 0x7fffe3568bf0, 48) = 0
> >>
> >> ^^^^ This is not so good:
> >> ^^^^
> >> ^^^^ a) The bpf() syscall is supposed to manipulate bpf objects. This
> >> ^^^^ is manipulating a cgroup. There's no reason that a socket creation
> >> ^^^^ filter couldn't be written in a different language (new iptables
> >> ^^^^ table? Simple list of address families?), but if that happened,
> >> ^^^^ then using bpf() to install it would be entirely nonsensical.
> >
> > I don't see why it's _modifing_ the cgroup. I'm looking at it as
> > network stack is using cgroup as an application group that should
> > invoke bpf program at the certain point in the stack.
> > imo cgroup management is orthogonal.
>
> It is literally modifying the struct cgroup, and, as a practical
> matter, it's causing membership in the cgroup to have a certain
> effect. But rather than pointless arguing, let me propose an
> alternative API that I think solves most of the problems here.
>
> In my model, BPF_PROG_ATTACH and BPF_PROG_DETACH go away completely.
> Instead, the cgroup gets three new control files:
> "net.ingress_filter", "net.egress_filter", and
> "net.socket_create_filter". Initially, if you read these files, you
> see "none\n".
>
> To attach a bpf filter, you open the file for write and do an ioctl on
> it. After doing the ioctl, if you read the file, you'll see
> "bpf:[hash]\n" where "[hash]" is exactly what you'd see in fdinfo for
> the bpf program.
>
> To detach any type of filter, bpf or otherwise, you open the file for
> write and write "none\n" (or just "none").
>
> If you write anything else to the file, you get -EINVAL. But, if
> someone writes a new type of filter (perhaps a simple list of address
> families), maybe you can enable the filter by writing something
> appropriate to the file.

I see no difference in what you're proposing vs what is already implemented
from feature set point of view, but the file approach is very ugly, since
it's a mismatch to FD style access that bpf is using everywhere.
In your proposal you'd also need to add bpf prefix everywhere.
So the control file names should be bpf_inet_ingress, bpf_inet_egress
and bpf_socket_create.
If you want to prepare such patch for them, I don't mind,
but you cannot kill syscall command, since it's more flexible
and your control-file approach _will_ be obsolete pretty quickly.

> Now the API matches the effect. A cgroup with something other than
> "none" in one of its net.*_filter files is a cgroup that filters
> network activity. And you get CRIU compatibility for free: CRIU can
> read the control file and use whatever mechanism is uses for BPF in
> general to restore the cgroup filter state. As an added bonus, you
> get ACLs for free and the ugly multiplexer goes away.

extended bpf is not supported by criu. only classic, so having
control_file-style attachment doesn't buy us anything.

> >> # mkdir cg2/nosockets/sockets
> >> # /home/luto/apps/linux/samples/bpf/cgrp_socket_rule cg2/nosockets/sockets/ 1
> >>
> >> ^^^^ This succeeded, which means that, if this feature is enabled in 4.10,
> >> ^^^^ then we're stuck with its semantics. If it returned -EINVAL instead,
> >> ^^^^ there would be a chance to refine it.
> >
> > i don't see the problem with this behavior. bpf and cgroup are indepedent.
> > Imange that socket accounting program is attached to cg2/nosockets.
> > The program is readonly and carry no security meaning.
> > Why cgroup should prevent creation of cg2/nosockets/foo directory ?
>
> I think you're misunderstanding me. What I'm saying is that, if you
> allow a cgroup and one of its descendents to both enable the same type
> of filter, you have just committed to some particular semantics for
> what happens. And I think that the current semantics are the *wrong*
> semantics for default long-term use, so you should either fix the
> semantics or disable the problematic case.

Are you saying that use cases I provided are also 'wrong'?
If you insist on that we won't be able to make any forward progress.
The current semantics is fine for what it's designed for.

> >> # echo $$ >cg2/nosockets/sockets/cgroup.procs
> >> # ping 127.0.0.1
> >> PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
> >> 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.029 ms
> >
> > hmm. in this case the admin created a subgroup with a group that allows
> > ping, so? how's that a problem?
>
> I think you're forgetting about namespaces. There are two different
> admins here. There's the admin who created the outer container and
> said "no sockets here". Then there's the admin inside the container
> who said "muahaha, I can create a sub-container and allow sockets
> *there*".

container management frameworks should be doing sensible
things. With any infra in the kernel there are many ways to
create non-sensical setups. It's not a job of the kernel
to interfere with user space.

> > In case of vrf I can imagine
> > a set of application auto-binding to one vrf and smaller
> > set of applications binding to a different vrf.
> > Or broader set of applications monitoring all tcp traffic
> > and subset of them monitoring udp instead.
> > Those are valid use cases.
>
> > I guess you're advocating to run a link list of programs?
> > That won't be useful for the above use cases, where there is no
> > reason to run more than one program that control plane
> > assigned to run for this cgroup.
>
> Yes there is, for both monitoring and policy. If I want to monitor
> all activity in a cgroup, I probably want to monitor descendents as
> well. If I want to restrict a cgroup, I want to restrict its
> descendents.

you're ignoring use cases I described earlier.
In vrf case there is only one ifindex it needs to bind to.

> In the case where I actually don't want to hook the descendents, I'd
> be find with having an option to turn off recursion. Maybe
> net.egress_filter could also say "bpf(overridable):[hash]" or
> "bpf(nonrecursive):[hash]". But you should have to opt in to allowing
> your filter to be overridden.

'opt in' only makes sense if you're implementing sandboxing environment.
It doesn't make sense as the default.

> > At this stage we decided to allow only one program per cgroup per hook
> > and later can extend it if necessary.
>
> No you can't. Since you allow the problematic case and you gave it
> the surprising "one program" semantics, you can't change it later.

please show me why we cannot? As far as I can see nothing prevents
that in the future. We can add any number of new fields to
BPF_PROG_ATTACH command just like we did in the past with
other commands, whereas control file interface is not extensible.

> > As you're pointing out, in case of security, we probably
> > want to preserve original bpf program that should always be
> > run first and only after it returned 'ok' (we'd need to define
> > what 'ok' means in secruity context) run program attached to sub-hierarchy.
>
> It's already defined AFAICT. 1 means okay. 0 means not okay.

sorry that doesn't make any sense. For seccomp we have a set of
ranges that mean different things. Here you're proposing to
hastily assign 1 and 0 ? How is that extensible?
We need to carefully think through what should be the semantics
of attaching multiple programs, consider performance implications,
return codes and so on.

> > Another alternative is to disallow attaching programs in sub-hierarchy
> > if parent has something already attached, but it's not useful
> > for general case.
> > All of these are possible future extensions.
>
> I like this last one, but IT'S NOT A POSSIBLE FUTURE EXTENSION. You
> have to do it now (or disable the feature for 4.10). This is why I'm
> bringing this whole thing up now.

We don't have to touch user visible api here, so extensions are fine.

> >> In 4.10 with With CONFIG_CGROUP_BPF=y, a cgroup can have bpf
> >> programs attached that can do things if various events occur. (Right
> >> now, this means socket operations, but there are plans in the works
> >> to do this for LSM hooks too.) These bpf programs can say yes or no,
> >> but they can also read out various data (including socket payloads!)
> >> and save them away where an attacker can find them. This sounds a
> >> lot like seccomp with a narrower scope but a much stronger ability to
> >> exfiltrate private information.
> >
> > that sounds like a future problem to solve when bpf+lsm+cgroup are
> > used for security.
>
> [...]
>
> >
> > I disagree with the urgency. I see nothing that needs immediate action.
> > bpf+lsm+cgroup is not in the tree yet.
> >
>
> I disagree very strongly here. The API in 4.10 is IMO quite ugly, but
> the result if bpf+lsm+cgroup works *differently* will be far, far
> uglier.

again we're talking about the future here and 'ugly' is the matter of taste.
I hear all the time that people say that netlink api is ugly, so?

> I think the right solution here is to clean up the API so that it'll
> work for future extensions that people are already imagining. If this
> can happen for 4.10, great. If not, then postpone this stuff
> entirely. I've had code I've written for Linux postponed extensively
> until I've gotten the API right, and it's not so bad. (Actually, I've
> even had API changes I've made reverted in -stable, IIRC. This is
> much worse than postponing before a release.)

huh? 'not right api' because it's using bpf syscall instead
of cgroup control-file? I think the opposite is the truth.
syscall style is more extensible whereas control-file and text
based interfaces have proven to be huge pain to extend and maintain.
Again, I don't mind if you want to have both available.