Re: [PATCH v5 04/11] blksnap: header file of the module interface

From: Dave Chinner
Date: Tue Jun 13 2023 - 18:25:30 EST


On Mon, Jun 12, 2023 at 03:52:21PM +0200, Sergei Shtepa wrote:
> The header file contains a set of declarations, structures and control
> requests (ioctl) that allows to manage the module from the user space.
>
> Co-developed-by: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Signed-off-by: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Tested-by: Donald Buczek <buczek@xxxxxxxxxxxxx>
> Signed-off-by: Sergei Shtepa <sergei.shtepa@xxxxxxxxx>
> ---
> MAINTAINERS | 1 +
> include/uapi/linux/blksnap.h | 421 +++++++++++++++++++++++++++++++++++
> 2 files changed, 422 insertions(+)
> create mode 100644 include/uapi/linux/blksnap.h


.....

> +/**
> + * struct blksnap_snapshot_append_storage - Argument for the
> + * &IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE control.
> + *
> + * @id:
> + * Snapshot ID.
> + * @bdev_path:
> + * Device path string buffer.
> + * @bdev_path_size:
> + * Device path string buffer size.
> + * @count:
> + * Size of @ranges in the number of &struct blksnap_sectors.
> + * @ranges:
> + * Pointer to the array of &struct blksnap_sectors.
> + */
> +struct blksnap_snapshot_append_storage {
> + struct blksnap_uuid id;
> + __u64 bdev_path;
> + __u32 bdev_path_size;
> + __u32 count;
> + __u64 ranges;
> +};
> +
> +/**
> + * define IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE - Append storage to the
> + * difference storage of the snapshot.
> + *
> + * The snapshot difference storage can be set either before or after creating
> + * the snapshot images. This allows to dynamically expand the difference
> + * storage while holding the snapshot.
> + *
> + * Return: 0 if succeeded, negative errno otherwise.
> + */
> +#define IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE \
> + _IOW(BLKSNAP, blksnap_ioctl_snapshot_append_storage, \
> + struct blksnap_snapshot_append_storage)

That's an API I'm extremely uncomfortable with. We've learnt the
lesson *many times* that userspace physical mappings of underlying
file storage are unreliable.

i.e. This is reliant on userspace telling the kernel the physical
mapping of the filesystem file to block device LBA space and then
providing a guarantee (somehow) that the mapping will always remain
unchanged. i.e. It's reliant on passing FIEMAP data from the
filesystem to userspace and then back into the kernel without it
becoming stale and somehow providing a guarantee that nothing (not
even the filesystem doing internal garbage collection) will change
it.

It is reliant on userspace detecting shared blocks in files and
avoiding them; it's reliant on userspace never being able to read,
write or modify that file; it's reliant on the -filesystem- never
modifying the layout of that file; it's even reliant on a internal
filesystem state that has to be locked down before the block mapping
can be delegated to a third party for IO control.

Further, we can't allow userspace to have any read access to the
snapshot file even after it is no longer in use by the blksnap
driver. The contents of the file will span multiple security
contexts, contain sensitive data, etc and so it's contents must
never be exposed to userspace. We cannot rely on userspace to delete
it safely after use and hence we have to protect it's contents
from exposure to userspace, too.

We already have a mechanism that provides all these guarantees to a
third party kernel subsystem: swap files.

We already have a trusted path in the kernel to allow internal block
mapping of a swap file to be retreived by the mm subsystem. We also
have an inode flag that protects it such files against access and
modification from anything other than internal kernel IO paths. We
also allow them to be allocated as unwritten extents using
fallocate() and we are never converted to written whilist in use as
a swapfile. Hence the contents of them cannot be exposed to
userspace even if the swapfile flag is removed and owner/permission
changes are made to the file after it is released by the kernel.

Swap files are an intrinsically safe mechanism for delegating fixed
file mappings to kernel subsystems that have requirements for
secure, trusted storage that userspace cannot tamper with.

I note that the code behind the
IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE ends up in
diff_storage_add_range(), which allocates an extent structure for
each range and links it into a linked list for later use.

This is effectively the same structure that the mm swapfile code
uses. It provides a swap_info_struct and a struct file to the
filesystem via aops->swap_activate. The filesystem then iterates the
extent list for the file and calls add_swap_extent() for each
physical range in the file. The mm code then allocates a new extent
structure for the range and links it into the extent rbtree in the
swap_info_struct. This is the mapping it uses later on in the IO
path.

Adding a similar, more generic mapping operation that allows a
private structure and a callback to the provided would allow the
filesystem to provide this callback directly to subsystems like
blksnap. Essentially diff_storage_add_range() becomes the iterator
callback for blksnap. This makes the whole "userspace provides the
mapping" problem goes away and we can use the swapfile mechanisms to
provide all the other guarantees the kernel needs to ensure it can
trust the contents and mappings of the blksnap snapshot files....

Thoughts?

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx