Re: RFC: fsyscall

From: Andy Lutomirski
Date: Tue Sep 08 2015 - 19:19:06 EST


On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>
>> On Tue, Sep 8, 2015 at 3:35 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>>
>>> I was thinking a bit about the problem of allowing another process to
>>> perform a subset of what your process can perform, and it occured to me
>>> there might be something conceptually simple we can do.
>>>
>>> Have a system call fsyscall that takes a file descriptor the system call
>>> number and the parameters to that system call as arguments. AKA
>>> long fsyscall(int fd, long number, ...); AKA syscall with a file
>>> desciptor argument.
>>>
>>> The fd would hold a struct cred, and a filter that limits what system
>>> calls and which parameters may be passed.
>>>
>>> The implementation of fsyscall would be something like:
>>> old = override_creds(f->f_cred);
>>> /* Perform filtered syscallf */
>>> revert_creds(old);
>>>
>>> Then we have another system call call it fsyscall_create(...) that takes
>>> a bpf filter and returns a file descriptor, that can be used with
>>> fsyscall.
>>>
>>> I'm not certain that bpf is the best way to create such a filter but it
>>> seems plausible, and we already have the infrastructure in place, so if
>>> nothing else there would be synergy in syscall filtering.
>>>
>>> My two concerns with bpf are (a) it seems a little complex for the
>>> simplest use cases. (b) I think there cases like inspecting the data
>>> passed into write, or send, or the structure passed into ioctl that it
>>> doesn't handle well yet.
>>>
>>> Andy does a fsyscall system call sound like something that would be not
>>> be too bad to implement? (You have just been through all of the x86
>>> system call paths recently).
>>
>> It's not possible yet due to nasty calling convention issues.
>> (Entries in the x86 syscall table aren't actually functions callable
>> using the C ABI right now.) My pending monster patchset will make it
>> possible to implement for 32-bit syscalls (native and compat). I'm
>> planning on addressing 64-bit, and I want to do almost the reverse of
>> what you're proposing: have a way that one task can trap into a
>> special mode in which another process can do syscalls on its behalf.
>
> Hmm. That seems comparatively dangerous to me.
>
>> There are some syscalls for which this simply makes no sense.
>> Setresuid, capset, and similar come to mind. Clone and friends may
>> screw up impressively if you try this. fsyscall should not be allowed
>> to call itself. If you call write(2) like this and it has any
>> meaningful effect, something's wrong.
>
> If you peak into the data that is being written it can be meaningful on
> write(2).
>
> Hmm. But yes for file descriptor based system calls this is much less
> interesting. Having some kind of wrapper that embeds one file
> descriptor in another and does the filtering that way seems more
> interesting, for the file descriptor based methods.
>
>> keyctl(2) does really awful
>> things wrt struct cred, and I don't really want to think about what
>> happens if you try calling it like this.
>>
>> override_creds is IMO awful. Serge and I had an old discussion on how
>> to maybe fix it.
>>
>> Honestly, I think the way to go might be to get Capsicum, or at least
>> Capsicum's fd model, merged and to add a mode in which the *at
>> operations on a specially marked fd use the passed fd's f_cred instead
>> of the caller's. (Cc: David Drysdale -- that feature might be really
>> nice.)
>
> Perhaps I had missed it but I don't recall capsicum being able to wrap
> things like reboot(2).
>

Ah, so you want to be able to grant BPF-defined capabilities :)

Off the top of my head, I think that doing this using a nice IPC
mechanism (which barely exists in Linux, but which seL4 and binder (!)
can do very cleanly) would be simpler and more general, if less
self-contained.

(Aside: how on earth does anyone think that replacing binder with
kdbus makes any sense? Binder can pass capabilities, and kdbus can't.
OTOH, maybe Android doesn't use the capability-passing ability.)

> Which really describes what I am trying to tackle. How do we create an
> object that we can pass between processes that limits what we can do in
> the case of the oddball syscalls that require special privileges.
>
> At the same time I still want the caller to be able to pass in data to
> the system calls being called such as REBOOT_CMD_POWER_OFF versus
> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
> REBOOT_CMD_CAD_OFF.
>

We could have a conservative whitelist of syscalls for which we allow
this usage. I'm a bit worried that there will be very limited use
cases, given that a lot of use cases will want to follow pointers,
which has TOCTOU problems.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/