Re: [PATCH] context_tracking: Restore previous state in schedule_user

From: Frederic Weisbecker
Date: Wed Dec 03 2014 - 20:13:24 EST


On Wed, Dec 03, 2014 at 04:38:46PM -0800, Andy Lutomirski wrote:
> On Wed, Dec 3, 2014 at 4:30 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
> > On Wed, Dec 03, 2014 at 04:04:31PM -0800, Andy Lutomirski wrote:
> > > On Wed, Dec 3, 2014 at 3:58 PM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> > > > On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
> > > >> It appears that some SCHEDULE_USER (asm for schedule_user) callers
> > > >> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
> > > >> and schedule_user will return in RCU user context. This causes RCU
> > > >> warnings and possible failures.
> > > >>
> > > >> This is intended to be a minimal fix suitable for 3.18.
> > > >>
> > > >> Reported-by: Dave Jones <davej@xxxxxxxxxx>
> > > >> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> > > >> Cc: Frédéric Weisbecker <fweisbec@xxxxxxxxx>
> > > >> Cc: Paul McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > > >> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> > > >
> > > > Ah, we sent it about at the same time :-)
> > > >
> > > > Might be too late for 3.18 though because it's not a regression.
> >
> > Wait, so how come that trace didn't start showing up until recently ?
>
> Looking at the code, it's because int_careful has the same bug, but
> syscall_trace_leave does:
>
> /*
> * We may come here right after calling schedule_user()
> * or do_notify_resume(), in which case we can be in RCU
> * user mode.
> */
> user_exit();
>
> which means that this issue was anticipated when that comment was written.

Indeed, in fact it was expected to work as long as the code that follows the
syscall is limited to schedule_user(), syscall_trace_leave() and do_notify_resume().
But if anything else is called and uses RCU, this doesn't work anymore.

So user_enter() and user_exit() have been designed to be re-entrant on purpose.

>
> Prior to the 3.18 seccomp changes and the _TIF_WORK typo fix, it would
> have been difficult to hit sysret_audit when context tracking was on
> (you could do it once on the way out from a syscall that enabled
> context tracking). So this is 3.18 regression.

I see now.

So the real problem is not on schedule_user(). It's rather that __audit_syscall_exit()
should we wrapped inside user_exit()/user_enter() or exception_foo(). The latter
is safer in a sensitive patch. That would be the real and simple regression fix.
Tweaking schedule_user() is more risky.

Then, if you like, we can rethink the whole later, define syscall_trace_leave()
as the only place that calls user_enter() and all the other syscall exit functions
(schedule_user(), do_notify_resume(), __audit_syscall_exit()) can just call
exception_enter() - exception_exit() if they can be called after syscall_trace_leave().
Then finally we can make user_enter and user_exit non-reentrant after careful audit
of how other archs use it (sounds scary though).

Or better yet: if you rework the syscall exit slow path, lets call user_enter() at the
very end of the syscall.

>
> The sysret_audit code is still totally screwed up AFAICT. At the very
> least, the whole mess rather strongly suggests that, if both context
> tracking and audit are on, then __audit_syscall_exit is called *twice*
> on each syscall. __audit_syscall_exit seems to be idempotent, so
> maybe no one has noticed that little glitch.
>
> I'll ask the x86 people to include my sysret_audit removal for 3.19,
> since I think that this schedule_user change is a better last-minute
> fix than removing a whole chunk of asm.
>
> --Andy
>
> >
> > Dave
> >
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/