Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

From: David Howells
Date: Tue Aug 11 2020 - 20:05:46 EST


Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> [ I missed the beginning of this discussion, so maybe this was already
> suggested ]

Well, the start of it was my proposal of an fsinfo() system call. That at its
simplest takes an object reference (eg. a path) and an integer attribute ID (it
could use a string instead, I suppose, but it would mean a bunch of strcmps
instead of integer comparisons) and returns the value of the attribute. But I
allow you to do slightly more interesting things than that too.

Miklós seems dead-set against adding a system call specifically for this -
though he's proposed extending open in various ways and also proposed an
additional syscall, readfile(), that does the open+read+close all in one step.

I think also at some point, he (or maybe James?) proposed adding a new magic
filesystem mounted somewhere on proc (reflecting an open fd) that then had a
bunch of symlinks to somewhere in sysfs (reflecting a mount). The idea being
that you did something like:

fd = open("/path/to/object", O_PATH);
sprintf(name, "/proc/self/fds/%u/attr1", fd);
attrfd = open(name, O_RDONLY);
read(attrfd, buf1, sizeof(buf1));
close(attrfd);
sprintf(name, "/proc/self/fds/%u/attr2", fd);
attrfd = open(name, O_RDONLY);
read(attrfd, buf2, sizeof(buf2));
close(attrfd);

or:

sprintf(name, "/proc/self/fds/%u/attr1", fd);
readfile(name, buf1, sizeof(buf1));
sprintf(name, "/proc/self/fds/%u/attr2", fd);
readfile(name, buf2, sizeof(buf2));

and then "/proc/self/fds/12/attr2" might then be a symlink to, say,
"/sys/mounts/615/mount_attr".

Miklós's justification for this was that it could then be operated from a shell
script without the need for a utility - except that bash, at least, can't do
O_PATH opens.

James has proposed making fsconfig() able to retrieve attributes (though I'd
prefer to give it a sibling syscall that does the retrieval rather than making
fsconfig() do that too).

> {
> int fd, attrfd;
>
> fd = open(path, O_PATH);
> attrfd = openat(fd, name, O_ALT);
> close(fd);
> read(attrfd, value, size);
> close(attrfd);
> }

Please don't go down this path. You're proposing five syscalls - including
creating two file descriptors - to do what fsinfo() does in one.

Do you have a particular objection to adding a syscall specifically for
retrieving filesystem/VFS information?

-~-

Anyway, in case you're interested in what I want to get out of this - which is
the reason for it being posted in the first place:

(*) The ability to retrieve various attributes of a filesystem/superblock,
including information on:

- Filesystem features: Does it support things like hard links, user
quotas, direct I/O.

- Filesystem limits: What's the maximum size of a file, an xattr, a
directory; how many files can it support.

- Supported API features: What FS_IOC_GETFLAGS does it support? Which
can be set? Does it have Windows file attributes available? What
statx attributes are supported? What do the timestamps support?
What sort of case handling is done on filenames?

Note that for a lot of cases, this stuff is fixed and can just be memcpy'd
from rodata. Some of this is variable, however, in things like ext4 and
xfs, depending on, say, mkfs configuration. The situation is even more
complex with network filesystems as this may depend on the server they're
talking to.

But note also that some of this stuff might change file-to-file, even
within a superblock.

(*) The ability to retrieve attributes of a mount point, including information
on the flags, propagation settings and child lists.

(*) The ability to quickly retrieve a list of accessible mount point IDs,
with change event counters to permit userspace (eg. systemd) to quickly
determine if anything changed in the even of an overrun.

(*) The ability to find mounts/superblocks by mount ID. Paths are not unique
identifiers for mountpoints. You can stack multiple mounts on the same
directory, but a path only sees the top one.

(*) The ability to look inside a different mount namespace - one to which you
have a reference fd. This would allow a container manager to look inside
the container it is managing.

(*) The ability to expose filesystem-specific attributes. Network filesystems
can expose lists of servers and server addresses, for instance.

(*) The ability to use the object referenced to determine the namespace
(particularly the network namespace) to look in. The problem with looking
in, say, /proc/net/... is that it looks at current's net namespace -
whether or not the object of interest is in the same one.

(*) The ability to query the context attached to the fd obtained from
fsopen(). Such a context may not have a superblock attached to it yet or
may not be mounted yet.

The aim is to allow a container manager to supervise a mount being made in
a container. It kind of pairs with fsconfig() in that respect.

(*) The ability to query mount and superblock event counters to help a
watching process handle overrun in the notifications queue.


What I've done with fsinfo() is:

(*) Provided a number of ways to refer to the object to be queried (path,
dirfd+path, fd, mount ID - with others planned).

(*) Made it so that attibutes are referenced by a numeric ID to keep search
time minimal. Numeric IDs must be declared in uapi/linux/fsinfo.h.

(*) Made it so that the core does most of the work. Filesystems are given an
in-kernel buffer to copy into and don't get to see any userspace pointers.

(*) Made it so that values are not, by and large, encoded as text if it can be
avoided. Backward and forward compatibility on binary structs is handled
by the core. The filesystem just fills in the values in the UAPI struct
in the buffer. The core will zero-pad or truncate the data to match what
userspace asked for.

The UAPI struct must be declared in uapi/linux/fsinfo.h.

(*) Made it so that, for some attributes, the core will fill in the data as
best it can from what's available in the superblock, mount struct or mount
namespace. The filesystem can then amend this if it wants to.

(*) Made it so that attributes are typed. The types are few: string, struct,
list of struct, opaque. Structs are extensible: the length is the
version, a new version is required to be a superset of the old version and
excess requestage is simply cleared by the kernel.

Information about the type of an attribute can be queried by fsinfo().


What I want to avoid:

(*) Adding another magic filesystem.

(*) Adding symlinks from proc to sysfs.

(*) Having to use open to get an attribute.

(*) Having to use multiple opens to get an attribute.

(*) Having to pathwalk to get to the attribute from the object being queried.

(*) Allocating another O_ open flag for this.

(*) Avoidable text encoding and decoding.

(*) Letting the filesystem access the userspace buffer.

Note that I'm not against splitting fsinfo() into a set of sibling syscalls if
that makes it more palatable, or even against using strings for the attribute
IDs, though I'd prefer to avoid the strcmps.

David