[PATCH RFC 0/4] bpf: cgroup device guard for non-initial user namespace

From: Michael Weiß
Date: Mon Aug 14 2023 - 10:27:38 EST


Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
which allows to set a cgroup device program to be a device guard.
This may be used to guard actions on device nodes in non-initial
userns, e.g., mknod.

If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.

A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.

To circumvent this limitation, we allow mknod() in the VFS if a
bpf cgroup device guard is enabled for the current task and check
CAP_MKNOD for the current user namespace instead of the init userns.

To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.

Tested for a GyroidOS container generated by the cmld using the
following user space patch: https://github.com/gyroidos/cml/pull/394

I discussed this internally with Christian in the UAPI group, earlier.
I put this to the public list now, since also LXC/LXD Folks have
announced interest on this.

This series applies to the latest mainline v6.5-rc6 tag.

Signed-off-by: Michael Weiß <michael.weiss@xxxxxxxxxxxxxxxxxxx>
---
Michael Weiß (4):
bpf: add cgroup device guard to flag a cgroup device prog
bpf: provide cgroup_device_guard in bpf_prog_info to user space
device_cgroup: wrapper for bpf cgroup device guard
fs: allow mknod in non-initial userns using cgroup device guard

fs/namei.c | 19 ++++++++++++++++---
include/linux/bpf-cgroup.h | 7 +++++++
include/linux/bpf.h | 1 +
include/linux/device_cgroup.h | 7 +++++++
include/uapi/linux/bpf.h | 8 +++++++-
kernel/bpf/cgroup.c | 30 ++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 6 +++++-
security/device_cgroup.c | 10 ++++++++++
tools/bpf/bpftool/prog.c | 2 ++
tools/include/uapi/linux/bpf.h | 8 +++++++-
10 files changed, 92 insertions(+), 6 deletions(-)
---
base-commit: 2ccdd1b13c591d306f0401d98dedc4bdcd02b421
change-id: 20230814-devcg_guard-5398ef84bf7b

Best regards,
--
Michael Weiß <michael.weiss@xxxxxxxxxxxxxxxxxxx>