Re: [PATCH] kernfs: use stack-buf for small writes.

From: NeilBrown
Date: Wed Sep 24 2014 - 19:59:15 EST


On Wed, 24 Sep 2014 20:17:27 +0100 Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:

> On Tue, Sep 23, 2014 at 02:06:33PM +1000, NeilBrown wrote:
> >
> > For a write <= 128 characters, don't use kmalloc.
> >
> > mdmon, part of mdadm, will sometimes need to write
> > to a sysfs file in order to allow writes to the array
> > to continue. This is important to support RAID metadata
> > types that the kernel doesn't know about.
> >
> > It is important that this write doesn't block on
> > memory allocation. The safest way to ensure that is to
> > use an on-stack buffer.
>
> Is it, now? What happens if caller of write(2) loses CPU and gets
> the page it's writing from swapped out, just as it's getting to
> issuing the actual syscall? copy_from_user() will fault and cause
> memory allocation for you...

That problem is easily solved with

mlockall(MCL_CURRENT|MCL_FUTURE);

That will prevent any swapout.

I see this effort a bit like an extension to MCL_FUTURE.

A process that wants to avoid latency from the memory system can use
mlockall() and then pre-allocate any memory that it might need.
Then it can be sure it will not block on memory accesses.

However it is not possible to pre-allocate memory that will be used
by the kernel, such as in a write to a sysfs file.

So my real goal is to ensure that when MCL_FUTURE has been set, it is
possible to pre-allocate any kernel memory that might be needed.
For this case it was easiest to simply "not need" any kernel memory.

I have a patch which ensures that O_DIRECT writes to a block device will not
block on memory allocations. It keeps the 'struct dio' around precisely if
MCL_FUTURE was used. So after the first access (read or write) future writes
will not block in memory allocation.

An alternate approach might be to allow PF_MEMALLOC to be permanently set
from user-space, but I suspect that is too heavy handed and wouldn't be a
good solution.


A key point here is that it is the "process" rather than the "file" that is
special. It isn't that accesses to the file should never block. It is that
the process should be able to avoid blocking. That is why I don't really
want to special-case a few sysfs attribute files.

>
> > Writes are always small, typically less than 10 characters.
>
> Translation: "but my userland code uses it _that_ way; surely, nobody
> would do anything else..."

That isn't how I see it.
These are sysfs attribute files. By definition they contain precisely one
value - typically a bool, enum, or integer.
I realise that this "definition" is often stretched and sometimes completely
ignored, but it is largely adhered to and when it is violated it is more
often the "read" side than the "write" side.

So it isn't just "my userland code" - it is any sane code on any sane sysfs
attribute. "typically less than 10" is probably not helpful, but the code
was for "less than 128 is all interesting cases" and that covers all numbers,
enums, bools and anything that isn't a gross violation of "one value".

But if you don't like a limit (and I do agree that it is usually good to
avoid arbitrary limits even though we already have an arbitrary limit of
PAGE_SIZE) we could do it like this.
This allows the userland code to pre-allocate a buffer by performing a write.
I haven't made it conditional on
current->mm->def_flags & VM_LOCKED
so it would only affect mlock(MCL_FUTURE) processes, but that would be easy.

Thanks for your comments,
NeilBrown


diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4429d6d9217f..a4d28f78cae1 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -278,40 +278,49 @@ static ssize_t kernfs_fop_write(struct file *file, const char __user *user_buf,
len = min_t(size_t, count, PAGE_SIZE);
}

- buf = kmalloc(len + 1, GFP_KERNEL);
- if (!buf)
- return -ENOMEM;
-
- if (copy_from_user(buf, user_buf, len)) {
- len = -EFAULT;
- goto out_free;
- }
- buf[len] = '\0'; /* guarantee string termination */
-
/*
* @of->mutex nests outside active ref and is just to ensure that
* the ops aren't called concurrently for the same open file.
+ * Also protects of->buf.
*/
mutex_lock(&of->mutex);
if (!kernfs_get_active(of->kn)) {
mutex_unlock(&of->mutex);
- len = -ENODEV;
- goto out_free;
+ return -ENODEV;
}

+ if (of->buf && of->buflen >= len+1)
+ buf = of->buf;
+ else
+ buf = kmalloc(len + 1, GFP_KERNEL);
+ if (!buf) {
+ len = -ENOMEM;
+ goto out_unlock;
+ }
+ if (buf != of->buf) {
+ kfree(of->buf);
+ of->buf = buf;
+ of->buflen = len + 1;
+ }
+
+ if (copy_from_user(buf, user_buf, len)) {
+ len = -EFAULT;
+ goto out_unlock;
+ }
+ buf[len] = '\0'; /* guarantee string termination */
+
ops = kernfs_ops(of->kn);
if (ops->write)
len = ops->write(of, buf, len, *ppos);
else
len = -EINVAL;

- kernfs_put_active(of->kn);
- mutex_unlock(&of->mutex);
-
if (len > 0)
*ppos += len;
-out_free:
- kfree(buf);
+
+out_unlock:
+ kernfs_put_active(of->kn);
+ mutex_unlock(&of->mutex);
return len;
}

@@ -728,6 +737,7 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)

kernfs_put_open_node(kn, of);
seq_release(inode, filp);
+ kfree(of->buf);
kfree(of);

return 0;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf797c2c3..c6addf635990 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -179,6 +179,8 @@ struct kernfs_open_file {
struct mutex mutex;
int event;
struct list_head list;
+ char *buf;
+ int buflen;

size_t atomic_write_len;
bool mmapped;

Attachment: signature.asc
Description: PGP signature