Re: [RFC v2 1/2] kernel/sysctl: support setting sysctl parameters from kernel command line

From: Eric W. Biederman
Date: Thu Mar 26 2020 - 08:47:52 EST


Michal Hocko <mhocko@xxxxxxxxxx> writes:

> On Wed 25-03-20 17:20:40, Eric W. Biederman wrote:
>> Vlastimil Babka <vbabka@xxxxxxx> writes:
> [...]
>> > + if (strncmp(param, "sysctl.", sizeof("sysctl.") - 1))
>> > + return 0;
>>
>> Is there any way we can use a slash separated path. I know
>> in practice there are not any sysctl names that don't have
>> a '.' in them but why should we artifically limit ourselves?
>
> Because this is the normal userspace interface? Why should it be any
> different from calling sysctl?
> [...]

Why should the kernel command line implement userspace whims?
I was thinking something like: "sysctl/kernel/max_lock_depth=2048"
doesn't look too bad and it makes things like reusing our
kernel internal helpers much easier.

Plus it suggest that we could do the same for sysfs files:
"sysfs/kernel/fscaps=1"

And the code could be same for both cases except for the filesystem
prefix.

>> Further it will be faster to lookup the sysctls using the code from
>> proc_sysctl.c as it constructs an rbtree of all of the entries in
>> a directory. The code might as well take advantage of that for large
>> directories.
>
> Sounds like a good fit for a follow up patch to me. Let's make this
> as simple as possible for the initial version. But up to Vlastimil of course.

I would argue that reusing proc_sysctl.c:lookup_entry() should make the
code simpler, and easier to reason about.

Especially given the bugs in the first version with a sysctl path.
A clean separation between separating the path from into pieces and
looking up those pieces should make the code more robust.

That plus I want to get very far away from the incorrect idea that you
can have sysctls without compiling in proc support. That is not how
the code works, that is not how the code is tested.

It is also worth pointing out that:

proc_mnt = kern_mount(proc_fs_type);
for_each_sysctl_cmdline() {
...
file = file_open_root(proc_mnt->mnt_root, proc_mnt, sysctl_path, O_WRONLY, 0);
kernel_write(file, value, value_len);
}
kern_umount(proc_mnt);

Is not an unreasonable implementation.

There are problems with a persistent mount of proc in that it forces
userspace not to use any proc mount options. But a temporary mount of
proc to deal with command line options is not at all unreasonable.
Plus it looks like we can have kern_write do all of the kernel/user
buffer silliness.

> [...]
>
>> Hmm. There is a big gotcha in here and I think it should be mentioned.
>> This code only works because no one has done set_fs(KERNEL_DS). Which
>> means this only works with strings that are kernel addresses essentially
>> by mistake. A big fat comment documenting why it is safe to pass in
>> kernel addresses to a function that takes a "char __user*" pointer
>> would be very good.
>
> Agreed

Eric