Re: [RESEND RFC PATCH v2 00/14] device_cgroup: guard mknod for non-initial user namespace

From: Michael Weiß
Date: Tue Nov 28 2023 - 15:54:56 EST


On 24.11.23 19:08, Christian Brauner wrote:
> On Fri, Nov 24, 2023 at 05:47:32PM +0100, Christian Brauner wrote:
>>> - Integrate this as LSM (Christian, Paul)
>>
>> Huh, my rant made you write an LSM. I'm not sure if that's a good or bad
>> thing...
>>
>> So I dislike this less than the initial version that just worked around
>> SB_I_NODEV and this might be able to go somewhere. _But_ I want to see
the rules written down:

Since we have some new Ideas, I also will try to better describe
the vision and current device node interaction when submitting the next
version.

>
> Hm, I wonder if we're being to timid or too complex in how we want to
> solve this problem.
>
> The device cgroup management logic is hacked into multiple layers and is
> frankly pretty appalling.
>
> What I think device access management wants to look like is that you can
> implement a policy in an LSM - be it bpf or regular selinux - and have
> this guarded by the main hooks:
>
> security_file_open()
> security_inode_mknod()
>
> So, look at:
>
> vfs_get_tree()
> -> security_sb_set_mnt_opts()
> -> bpf_sb_set_mnt_opts()
>
> A bpf LSM program should be able to strip SB_I_NODEV from sb->s_iflags
> today via bpf_sb_set_mnt_opts() without any kernel changes at all.
> > I assume that a bpf LSM can also keep state in sb->s_security just like
> selinux et al? If so then a device access management program or whatever
> can be stored in sb->s_security.
>
> That device access management program would then be run on each call to:
>
> security_file_open()
> -> bpf_file_open()
>
> and
>
> security_inode_mknod()
> -> bpf_sb_set_mnt_opts()
>
> and take access security_sb_set_mnt_optsdecisions.
>
> This obviously makes device access management something that's tied
> completely to a filesystem. So, you could have the same device node on
> two tmpfs filesystems both mounted in the same userns.
>
> The first tmpfs has SB_I_NODEV and doesn't allow you to open that
> device. The second tmpfs has a bpf LSM program attached to it that has
> stripped SB_I_NODEV and manages device access and allows callers to open
> that device.

I like the approach to clear SB_I_NODEV in security_sb_set_mnt_opts().
I still have to sort this out but I think that was the missing piece in
the current patch set.

>
> I guess it's even possible to restrict this on a caller basis by marking
> them with a "container id" when the container is started. That can be
> done with that task storage thing also via a bpf LSM hook. And then
> you can further restrict device access to only those tasks that have a
> specific container id in some range or some token or something.
>
> I might just be fantasizing abilities into bpf that it doesn't have so
> anyone with the knowledge please speak up.
>
> If this is feasible then the only thing we need to figure out is what to
> do with the legacy cgroup access management and specifically the
> capable(CAP_SYS_ADMIN) check that's more of a hack than anything else.

First to make this clear, we speak about CAP_SYS_MKNOD.

My approach was to restructure the cgroup_device in an own cgroup_device
lsm not in the current bpf lsm, to also be able to handle the legacy calls.
Especially, the remaining direct calls to devcgroup_check_permission() are
transformed to a new security_hook security_dev_permission() which is
similar to security_inode_permission() but could be called in such places
where only the dev_t representation is available. However, if we
implement it that way you sketched up above, we can just leave the
devcgroup_check_permission() in its current place and only implement
the devcgroup_inode_permission() and devcgroup_mknode and let the blk
and amd/gpu drivers as is for now, or just leave all the devcgroup_*()
hooks as is.

>
> So, we could introduce a sysctl that makes it possible to turn this
> check into ns_capable(sb->s_userns, CAP_SYS_ADMIN). Because due to
> SB_I_NODEV it is inherently safe to do that. It's just that a lot of
> container runtimes need to have time to adapt to a world where you may
> be able to create a device but not be able to then open it. This isn't
> rocket science but it will take time.

True. I think a sysctl would be a good option.

>
> But in the end this will mean we get away with minimal kernel changes
> and using a lot of existing infrastructure.
>
> Thoughts?

For the whole bpf lsm part I'm not confident enough to make any
proposition, yet.
But I think an own simple devnode lsm would be feasible with the above
described security_sb_set_mnt_opts() handling to get this idea realized.
Maybe we go that way to implement a simple lsm without any changes to
the device_cgroup and keep the devcgroup hooks in place. To implement
it as bpf lsm with all full blown conatiner_id could then be done
orthogonally.
So from a simple container runtime perspective one could just use the
simple lsm and the existing bpf device cgroup program without any change
only having to activate the sysctl. A more complex cloud setup Kubernetes
what so ever, could then use bpf lsm approach.