Re: [PATCH] kernfs: implement custom llseek method to fix userspace regression

From: Valentin Sinitsyn
Date: Tue Aug 15 2023 - 04:27:01 EST


On 15.08.2023 01:01, Dan Williams wrote:
Valentine Sinitsyn wrote:
Since commit 636b21b50152 ("PCI: Revoke mappings like devmem"),
mmapable sysfs binary attributes have started receiving their
f_mapping from the iomem pseudo filesystem, so that
CONFIG_IO_STRICT_DEVMEM is honored in sysfs (and procfs) as well
as in /dev/[k]mem.

This resulted in a userspace-visible regression: lseek(fd, 0, SEEK_END)
now returns zero regardless the real sysfs attribute size which stat()
reports. The reason is that kernfs files use generic_file_llseek()
implementation, which relies on f_mapping->host inode to get the file
size. As f_mapping is now redefined, f_mapping->host points to an
anonymous zero-sized iomem inode which has nothing to do with sysfs
attribute or kernfs file representing it. This being said, f_inode
remains valid, so stat() which uses it works correctly.

Can you say a bit more about what userspace scenario regressed so that
others doing backports can make a judgement call on the severity?

We've encountered this regression in the code which used lseek() to determine the size of PCI region. It was roughly equivalent to:

#define SYSFS_DEVICE_DIR "/sys/bus/pci/devices/<some id>/"

int fd = open(SYSFS_DEVICE_DIR "/resource0", O_RDWR);
off_t size = lseek(fd, 0, SEEK_END);
assert(size != 0)

Calling lseek() with whence argument other than SEEK_END and non-zero offset on this fd returns an error as the kernel considers it as seeking past EOF.

I'll add this explanation to v2 commit message.



Fixes the regression by implementing a custom llseek fop for kernfs,
which uses an attribute's file inode to get the file size,
just as stat() does.

Fixes: 636b21b50152 ("PCI: Revoke mappings like devmem")
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Valentine Sinitsyn <valesini@xxxxxxxxxxxxxx>
---
fs/kernfs/file.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 180906c36f51..6d81e0c981f3 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -903,6 +903,21 @@ static __poll_t kernfs_fop_poll(struct file *filp, poll_table *wait)
return ret;
}
+static loff_t kernfs_fop_llseek(struct file *file, loff_t offset, int whence)
+{
+ /*
+ * This is almost identical to generic_file_llseek() except it uses
+ * cached inode value instead of f_mapping->host.
+ * The reason is that, for PCI resources in sysfs the latter points to
+ * iomem_inode whose size has nothing to do with the attribute's size.
+ */
+ struct inode *inode = file_inode(file);

My only concern is whether there are any scenarios where this is not
appropriate. I.e. do a bit more work to define a kernfs_ops instance
specifically for overriding lseek() in this scenario.

Not sure I'm getting you here: do you mean something like this?

struct inode *inode = is_f_mapping_redefined(file) ? file_inode(file) : file->f_mapping->host;

My understanding is file->f_inode should always be non-NULL and point to the inode corresponding the path of the opened file, so it should be safe to call regardless what f_mapping->host is. Do I miss anything?

Best,
Valentin


+
+ return generic_file_llseek_size(file, offset, whence,
+ inode->i_sb->s_maxbytes,
+ i_size_read(inode));
+}
+
static void kernfs_notify_workfn(struct work_struct *work)
{
struct kernfs_node *kn;
@@ -1005,7 +1020,7 @@ EXPORT_SYMBOL_GPL(kernfs_notify);
const struct file_operations kernfs_file_fops = {
.read_iter = kernfs_fop_read_iter,
.write_iter = kernfs_fop_write_iter,
- .llseek = generic_file_llseek,
+ .llseek = kernfs_fop_llseek,
.mmap = kernfs_fop_mmap,
.open = kernfs_fop_open,
.release = kernfs_fop_release,
--
2.34.1