[RFC PATCH 00/11] Add support for devtmpfs in user namespaces

From: Seth Forshee
Date: Wed May 14 2014 - 17:38:27 EST


Unpriveleged containers cannot run mknod, making it difficult to support
devices which appear at runtime. Using devtmpfs is one possible
solution, and it would have the added benefit of making container setup
simpler. But simply letting containers mount devtmpfs isn't sufficient
since the container may need to see a different, more limited set of
devices, and because different environments making modifications to
the filesystem could lead to conflicts.

This series solves these problems by assigning devices to user
namespaces. Each device has an "owner" namespace which specifies which
devtmpfs mount the device should appear in as well allowing priveleged
operations on the device from that namespace. This defaults to
init_user_ns. There's also an ns_global flag to indicate a device should
appear in all devtmpfs mounts.

devtmpfs is updated to present a different superblock to each user
namespace. Each super block contains nodes for only global devices and
the devices assigned to the associated namespace.

The implementation isn't complete at this point - it's lacking proper
cleanup when a namespace is no longer in use, and only a sampling of
devices are updated to support use in namespaces. I'm sending the
patches now for feedback on the overall approach and the implementation
so far. I also have a couple of areas where I'd appreciate some
suggestions:

* If devices are owned by a namespace it might be useful to have this
awareness for uevents and sysfs as well. Would it make sense to apply
the ownership to kobjects rather than devices?

* I'd like to be able to do clean up when a namespace is destroyed,
e.g. with loop devices I'd probably free up any devices owned by the
namespace. But that's impossible in the current implementation since
the device has a reference to the namespace. Any suggestions to get
around this? I haven't spent much time thinking about it yet, but my
first thought was to add some kind of weak reference to user
namespaces. Then when the main reference count hits zero the
namespace isn't destroyed, but there would be a notification that
drivers could use to perform cleanup. Once all weak references were
released the memory would actually be freed.

Thanks,
Seth


Seth Forshee (11):
driver core: Assign owning user namespace to devices
driver core: Add device_create_global()
tmpfs: Add sub-filesystem data pointer to shmem_sb_info
ramfs: Add sub-filesystem data pointer to ram_fs_info
devtmpfs: Add support for mounting in user namespaces
drivers/char/mem.c: Make null/zero/full/random/urandom available to
user namespaces
block: Make partitions inherit namespace from whole disk device
block: Allow blkdev ioctls within user namespaces
misc: Make loop-control available to all user namespaces
loop: Assign devices to current_user_ns()
loop: Allow priveleged operations for root in the namespace which owns
a device

block/compat_ioctl.c | 3 +-
block/ioctl.c | 16 +-
block/partition-generic.c | 2 +
drivers/base/core.c | 54 ++++-
drivers/base/devtmpfs.c | 509 ++++++++++++++++++++++++++++++++-------------
drivers/block/loop.c | 22 +-
drivers/char/mem.c | 28 ++-
drivers/char/misc.c | 11 +-
fs/ramfs/inode.c | 8 -
include/linux/device.h | 18 ++
include/linux/miscdevice.h | 1 +
include/linux/ramfs.h | 9 +
include/linux/shmem_fs.h | 1 +
13 files changed, 499 insertions(+), 183 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/