Re: [Regression full nohz] [PATCH] x86: Don't call context tracking APIs on IRQs

From: Frederic Weisbecker
Date: Fri Oct 16 2015 - 11:05:12 EST


On Tue, Oct 13, 2015 at 09:31:14AM -0700, Andy Lutomirski wrote:
> On Tue, Oct 13, 2015 at 9:01 AM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> > From: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> > Date: Sat, 3 Oct 2015 01:18:09 +0200
> > Subject: [PATCH] x86: Don't call context tracking APIs on IRQs
> >
> > IRQs already call irq_enter() and irq_exit() which take care of RCU
> > and vtime needs. There is no need to call user_enter() / user_exit()
> > on IRQs except on IRQ exit time if we schedule out or handle signals.
> >
> > This may result in performance regression when context tracking is
> > enabled, not to mention that enter_from_user_mode() is called all the
> > time on IRQ entry when CONFIG_CONTEXT_TRACKING=y (which is enabled on
> > many distros) even though context tracking is actually not running,
> > breaking the static key optimizations.
>
> I would be rather surprised if that makes a significant difference.

You should have tested that and prove me it doesn't make a difference before
getting stuff merged when I strictly opposed to.

But it's rather a question of design. It's ugly to call RCU and vtime APIs twice
when it's not needed to.

> It's a call to a function that should be very short right after an
> exceedingly heavy serializing operation (namely interrupt delivery).
> Is this easy to benchmark?

perf bench messaging perhaps, although full dynticks isn't optimized toward
IO and multithread but it can give an idea of the issue.

But does it really matter? Again calling RCU/vtime hooks twice unconditonally
is ugly.

>
> A nicer approach IMO would be to fix the static key optimizations.

For cases when full dynticks isn't running yeah (due to the unconditional function call),
for cases when full dynticks is running, it's rather a matter of not doubling context
tracking calls.

Probably the most efficient way would be to move rcu/vtime calls out of IRQs core code
down to arch entry.

Or the other way around: move the irq exit code from archs to core code (rescheduling,
signals) and optimize the rcu/vtime code there.

But please not this halfway state that makes things worse.

> >
> > This could be optimized with pulling irq_enter/exit to low level irq
> > code but that requires more thoughts.
> >
>
> Given the choice, I'd much rather do it that way.

Ok but that's a longer term view. Don't make things worse in the short term.

>
> FWIW, we're extremely close to being able to say (and verify at
> runtime!) that any kernel code with IRQs on is *always* tracked as
> kernel mode. Unless I've missed something, the single remaining
> exception is the 64-bit native syscall entry path. Once that's done,
> we can drop exception_enter entirely and we can drop the
> irq_save/irq_restore from the context tracking code itself. (Rik
> thinks that the save/restore is a fairly large fraction of the total
> overhead in the context-tracking-enabled case.)

I'm still skeptical for the exception part. Instead of making sure that
we never trigger an exception on kernel entry (through multiple hacks
like forbidding traps/breakpoints on entry code), we should keep the
internal context tracking state. Anything based on regs will always
be more fragile.

The plan I have in mind with lowest overhead and highest reliability is:



exception entry {
allocate on stack sizeof(regs) + sizeof(ctx_state)

asm static key call x86_exception_enter() {
ctx_state = exception_enter();
}
do exception
asm static key call x86_exception_exit() {
exception_enter(ctx_state);
}
}


> This patch is a step back from that.

It's a short term fix before we find a long term solution.

>
> > Signed-off-by: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> > ---
> > arch/x86/entry/common.c | 11 ++++++++++-
> > arch/x86/entry/entry_64.S | 11 ++++++-----
> > 2 files changed, 16 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > index 80dcc92..1b7a866 100644
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -229,6 +229,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
> > * work to clear some of the flags can sleep.
> > */
> > while (true) {
> > + enum ctx_state prev_state;
> > u32 cached_flags =
> > READ_ONCE(pt_regs_to_thread_info(regs)->flags);
> >
> > @@ -237,8 +238,10 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
> > _TIF_USER_RETURN_NOTIFY)))
> > break;
> >
> > +
> > /* We have work to do. */
> > local_irq_enable();
> > + prev_state = exception_enter();
>
> This loop is IMO fairly gross. We enter prepare_exit_to_usermode from
> a context that is either tracked as kernel mode or that was tracked as
> kernel mode and then switched back early (due to irq_exit or
> whatever), and now we run a loop and switch back and forth? I had
> hoped that this switching back and forth in a loop would stay dead
> after I killed it :)

Ideally we should do:

if (cached_flags) {
exception_enter()
while (true) {
do the exit loop
}
exception_exit()
}

to avoid the repetion of context tracking calls.

>
> >
> > if (cached_flags & _TIF_NEED_RESCHED)
> > schedule();
> > @@ -258,10 +261,16 @@ __visible void prepare_exit_to_usermode(struct pt_regs *regs)
> > if (cached_flags & _TIF_USER_RETURN_NOTIFY)
> > fire_user_return_notifiers();
> >
> > + exception_exit(prev_state);
> > +
> > /* Disable IRQs and retry */
> > local_irq_disable();
> > }
> > +}
> >
> > +__visible void prepare_exit_to_usermode_track(struct pt_regs *regs)
> > +{
> > + prepare_exit_to_usermode(regs);
> > user_enter();
> > }
> >
>
> This change here seems likely to break things. All of the new code
> assumes that it's okay to call prepare_exit_to_usermode and then
> immediately go back to user mode.

It didn't break on my tests. Now I checked that only those interested in calling
user_enter() actually call the _track version.

Do you have a better solution?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/