Re: RFC: fsyscall

From: Eric W. Biederman
Date: Tue Sep 08 2015 - 20:32:48 EST


Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:

> On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:

>> Perhaps I had missed it but I don't recall capsicum being able to wrap
>> things like reboot(2).
>>
>
> Ah, so you want to be able to grant BPF-defined capabilities :)

Pretty much.

Where I am focusing is turning Posix capabilities into real
capabilities. I would not mind if the functionality was a bit more
general. Say to be able to handle things like security labels, or
anywhere else you might reasonably be asked can you do X?

But I would be happy if we just managed to wrap the Posix capabilities
and turned them into real capablilities.

> Off the top of my head, I think that doing this using a nice IPC
> mechanism (which barely exists in Linux, but which seL4 and binder (!)
> can do very cleanly) would be simpler and more general, if less
> self-contained.

Less self-contained becomes a problem when you want to pass them between
processes written at different times between different people. If there
is something conceptually simple we can implement in the kernel it
becomes worth it because that becomes the standard which everyone knows
to code to.

> (Aside: how on earth does anyone think that replacing binder with
> kdbus makes any sense? Binder can pass capabilities, and kdbus can't.
> OTOH, maybe Android doesn't use the capability-passing ability.)

kdbus has file descriptor passing. Beyond that no comment.

>> Which really describes what I am trying to tackle. How do we create an
>> object that we can pass between processes that limits what we can do in
>> the case of the oddball syscalls that require special privileges.
>>
>> At the same time I still want the caller to be able to pass in data to
>> the system calls being called such as REBOOT_CMD_POWER_OFF versus
>> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
>> REBOOT_CMD_CAD_OFF.
>>
>
> We could have a conservative whitelist of syscalls for which we allow
> this usage. I'm a bit worried that there will be very limited use
> cases, given that a lot of use cases will want to follow pointers,
> which has TOCTOU problems.

Time of check to time of use problems. Interesting point.

TOCTOU seems to make filtering of system calls in general much less
viable then I had hoped or imagined, and seems to be one of the better
arguments I have heard against ioctls.

I think the cases I care about are much less likely to have TOCTOU
problems than system calls in general, so I still may be ok.

However it does seem like past a certain point for good filtering the
entire syscall ABI needs to be turned into well defined IPC. Ick!

Sigh. I guess it is about time I dig up the places we call capable.
Ugh 1696 places in the kernel.. Even filtering out CAP_SYS_ADMIN and
CAP_NET_ADMIN the list is longer than I can easily look at.

Still reboot isn't a problem ;)

Thinking abou the TOCTOU problems with system call filtering the only
general solution I can see is to handle it like the compat syscalls
but instead of copying things into a temporary on buffer in userspace
we copy the data into a temporary in-kernel buffer (filter the system call)
fs = get_fs();
set_fs(get_ds());
/* Call the system call */
set_fs(fs);

I don't like the whole set_fs() thing (especially if there is any data
we did not manage to copy). But it seems like a good conceptual start.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/