Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set

From: Willy Tarreau
Date: Thu Jan 11 2018 - 13:06:01 EST


On Thu, Jan 11, 2018 at 09:53:26AM -0800, Andy Lutomirski wrote:
> >> So I think that no-pti mode is a privilege as opposed to a mode per
> >> se. If you can turn off PTI, then you have the ability to read all of
> >> kernel memory So maybe we should treat it as such. Add a capability
> >> CAP_DISABLE_PTI. If you have that capability (globally), then you can
> >> use the arch_prctl() or regular prctl() or whatever to turn PTI on.
> >> If you lose the cap, you lose no-pti mode as well.
> >
> > I disagree on this, because the only alternative I have is to decide
> > to keep my process running as root, which is even worse, as root can
> > much more easily escape from a chroot jail for example, or access
> > /dev/mem and read all the memory as well. Also tell Linus that he'll
> > have to build his kernels as root ;-)
>
> Not since Linux 4.3 :) You can set CAP_DISABLE_PTI as an "ambient"
> capability and let it be inherited by all descendents even if
> unprivileged. This was all very carefully designed so that a program
> that inherited an ambient capability but tries to run a
> less-privileged child using pre-4.3 techniques will correctly drop the
> ambient capability, which is *exactly* what we want for PTI.

Ah thanks for explaining what these "ambient" capabilities are, I saw
the term a few times but never looked closer.

> So I stand by my suggestion. Linus could still do:
>
> $ nopti make -j512
>
> and have it work, but trying to ptrace() the make process from outside
> the nopti process tree (and without doing nopti ptrace) will fail, as
> expected. (Unless root does the ptrace, again as expected.)

This may be reasonable.

> > The arch_prctl() calls I proposed only allow to turn PTI off for
> > privileged users but any user can turn it back on. For me it's
> > important. A process might trust itself for its own use, but such
> > processes will rarely trust external processes in case they need to
> > perform an occasional fork+exec. Imagine for example a static web
> > server requiring to achieve the highest possible performance and
> > having to occasionally call logrotate to rotate+compress the logs.
> > It's better if the process knows how to turn PTI back on before
> > calling this.
>
> In my proposal, CAP_DISABLE_PTI doesn't turn off PTI -- it just grants
> the ability to have PTI off. If you have PTI off, you can turn it
> back in using prctl() or whatever. So you call prctl() (to turn PTI
> back on) or capset() (to turn it on and drop the ability to turn it
> off).

Hmmm OK. I still don't like much to conflate the "turn back on" between
the two distinct calls. If the capability grants you the right to act
on prctl(), it should not perform the action in your back when disabled.
It may even lead people to care less about it which is not a good practise.

> How exactly do you plan to efficiently call logrotate from your
> webserver if the flag is per-mm, though? You don't want to fork() in
> your fancy server because fork() write-protects the whole address
> space.

I'm not following you, what's the problem here ? I mean most programs
do fork() then close() a few FDs that were not marked CLOEXEC, then
execve(). The purpose is to avoid changing existing programs too much.

> So you use clone() or vfork() (like posix_spawn() does
> internally on a sane libc), but now you can't turn PTI back on if it's
> per-mm because you haven't changed your mm.

But as soon as you write anything it's cloned, right ? Ie you write in
the stack. I could say bullshit here, but surely we have a way to split
them.

> I really really think it should be per thread.

It could be a good argument here.

> >> As for per-mm vs per-thread, let's make it only switchable in
> >> single-threaded processes for now and inherited when threads are
> >> created.
> >
> > That's exactly what it does for now, but Linus doesn't like it at all.
> > So I'll switch it back to per-mm + per-CPU variable. Well he has a valid
> > point regarding the pgd and _PAGE_NX setting. point Now at least we know
> > the change is minimal if we have a good reason for doing differently
> > later.
>
> Yuck, I hate this. Are you really going to wire it up complete with
> all the IPIs needed to get the thing synced up right? it's also going
> to run slower no matter what, I think, because you'll have to sync the
> per-mm state back to the TI flags on context switches.

At this point I'm lost, all I can do is trust you guys once you agree
on a solution :-)

> Linus, can you explain again why you think PTI should be a per-mm
> thing? That is, why do you think it's useful and why do you think it
> makes logical sense from a user's POV? I think the implementation is
> easier and saner for per-thread. Also, if I we use a capability bit
> for it, making it per-mm gets really weird.

thanks,
Willy