Re: Race-free block device opening

From: Demi Marie Obenour
Date: Tue Apr 26 2022 - 17:32:01 EST


On Tue, Apr 26, 2022 at 08:35:34PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Apr 26, 2022 at 02:12:22PM -0400, Demi Marie Obenour wrote:
> > Right now, opening block devices in a race-free way is incredibly hard.
> > The only reasonable approach I know of is sd_device_new_from_path() +
> > sd_device_open(), and is only available in systemd git main. It also
> > requires waiting on systemd-udev to have processed udev rules, which can
> > be a bottleneck. There are better approaches in various special cases,
> > such as using device-mapper ioctls to check that the device one has
> > opened still has the name and/or UUID one expects. However, none of
> > them works for a plain call to open(2).
>
> Why do you call open(2) on a block device?

There are many reasons to do so:

- Some programs invoke ioctls on the block device FD.
- Some programs perform I/O using a block device (or a partition)
directly. mkfs, fsck, dd, lvm, cryptsetup, and Ceph all fall in this
category.
- Some programs need to use the block device’s major and minor numbers
in device-mapper ioctls, and need to make sure that the major and
minor number won’t be recycled behind their back.
- Some programs need to pass the assign the device to a virtual machine.

> > A much better approach would be for udev to point its symlinks at
> > "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
> > "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions.
>
> You can do that today with udev rules, right?

One can make udev create a symlink with that path pointing to the kernel
device name, but not make udev’s other symlinks point to that path. It
is also still necessary to check (with BLKGETDISKSEQ) that the device
one opened is what one intended to open.

> > A
> > filesystem would then be mounted at "/dev/disk/by-diskseq" that provides
> > for race-free opening of these paths.
>
> How would it be any less race-free than just open("/dev/sda1") is?

Assuming you meant "more race-free", the answer is that /dev/sda1 is not
guarnateed to always point to the same device. This could happen if the
user unplugs their USB hard drive and plugs in a new one. The problem
is much more severe for virtual devices, such as /dev/loop* or
/dev/dm-*, which can be created and destroyed quite frequently.
If a diskseqfs is implemented and mounted on /dev/disk/by-diskseq,
opening /dev/disk/by-diskseq/1 will always either return the same device
every time, or return an error if the original device no longer exists.

> > This could be implemented in
> > userspace using FUSE, either with difficulty using the current kernel
> > API, or easily and efficiently using a new kernel API for opening a
> > block device by diskseq + partition. However, I think this should be
> > handled by the Linux kernel itself.
> >
> > What would be necessary to get this into the kernel?
>
> Get what exactly? I don't see anything the kernel needs to do here
> specifically. Normally block devices are accessed using mount(2), not
> open(2). Do you want a new mount(2)-type api?

I would like to have a filesystem, which will typically be mounted on
/dev/disk/by-diskseq, such that:

- Opening /dev/disk/by-diskseq/$DISKSEQ always returns a device with
sequence number $DISKSEQ or an error.
- Opening /dev/disk/by-diskseq/${DISKSEQ}p${PARTITION} always returns
partition $PARTITION of the device with diskseq $DISKSEQ or an error.
- If a device with diskseq $DISKSEQ exists, opening
/dev/disk/by-diskseq/$DISKSEQ will return a file descriptor to the
device, provide the user has sufficient permissions and no errors
happen.
- If a device with diskseq $DISKSEQ exists and has a partition
$PARTITION, opening /dev/disk/by-diskseq/${DISKSEQ}p${PARTITION} will
return a file descriptor to partition $PARTITION of the device
$DISKSEQ, provide the user has sufficient permissions and no errors
happen.
- Listing /dev/disk/by-diskseq will enumerate all path names for which
an open could succeed.

Obviously /dev/disk/by-diskseq can be replaced with any other path at
which diskseqfs is mounted, but I expect diskseqfs to typically be
mounted at that path.

--
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature