Re: [PATCH] [GIT PULL v2] x86: Workaround for NMI iret woes

From: Ingo Molnar
Date: Sun Dec 18 2011 - 04:26:10 EST



* Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> DECLARE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);
> #endif
> +static inline int is_debug_stack(unsigned long addr) { return 0; }
> +static inline void inc_debug_stack_usage(void) { }
> +static inline void dec_debug_stack_usage(void) { }
> +static inline void zero_debug_stack(void) { }
> +static inline void reset_debug_stack(void) { }

Naming nit: the pattern we use for methods like this is in the
form of:

#SUBSYS_#OP_#CONDITION()

For example:

atomic_inc_not_zero()

To match that pattern the above should be soemthing like:

debug_stack_usage_inc()
debug_stack_usage_dec()

You used the proper naming scheme for the variables btw:

> +static DEFINE_PER_CPU(unsigned long, debug_stack_addr);
> +static DEFINE_PER_CPU(int, debug_stack_usage);


[ The same applies to the other methods as well, such as
zero_debug_stack(), etc. ]

This:

> +void inc_debug_stack_usage(void)
> +{
> + __get_cpu_var(debug_stack_usage)++;
> +}
> +
> +void dec_debug_stack_usage(void)
> +{
> + __get_cpu_var(debug_stack_usage)--;
> +}

... if inlined doesnt it collapse to one or two instructions at
most? If yes then this might be worth inlining.

> +
> +int is_debug_stack(unsigned long addr)
> +{
> + return __get_cpu_var(debug_stack_usage) ||
> + (addr <= __get_cpu_var(debug_stack_addr) &&
> + addr > (__get_cpu_var(debug_stack_addr) - DEBUG_STKSZ));
> +}
> +
> +void zero_debug_stack(void)
> +{
> + load_idt((const struct desc_ptr *)&nmi_idt_descr);
> +}
> +
> +void reset_debug_stack(void)
> +{
> + load_idt((const struct desc_ptr *)&idt_descr);
> +}
> +
> #else /* CONFIG_X86_64 */
>
> DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
> @@ -1208,6 +1240,8 @@ void __cpuinit cpu_init(void)
> estacks += exception_stack_sizes[v];
> oist->ist[v] = t->x86_tss.ist[v] =
> (unsigned long)estacks;
> + if (v == DEBUG_STACK - 1)

One of the cases where checkpatch is wrong, best for this is:

> + if (v == DEBUG_STACK-1)


> ENTRY(nmi)
> INTR_FRAME
> PARAVIRT_ADJUST_EXCEPTION_FRAME
> - pushq_cfi $-1
> + /*
> + * We allow breakpoints in NMIs. If a breakpoint occurs, then
> + * the iretq it performs will take us out of NMI context.
> + * This means that we can have nested NMIs where the next
> + * NMI is using the top of the stack of the previous NMI. We
> + * can't let it execute because the nested NMI will corrupt the
> + * stack of the previous NMI. NMI handlers are not re-entrant
> + * anyway.
> + *
> + * To handle this case we do the following:
> + * Check the a special location on the stack that contains
> + * a variable that is set when NMIs are executing.
> + * The interrupted task's stack is also checked to see if it
> + * is an NMI stack.
> + * If the variable is not set and the stack is not the NMI
> + * stack then:
> + * o Set the special variable on the stack
> + * o Copy the interrupt frame into a "saved" location on the stack
> + * o Copy the interrupt frame into a "copy" location on the stack
> + * o Continue processing the NMI
> + * If the variable is set or the previous stack is the NMI stack:
> + * o Modify the "copy" location to jump to the repeate_nmi
> + * o return back to the first NMI
> + *
> + * Now on exit of the first NMI, we first clear the stack variable
> + * The NMI stack will tell any nested NMIs at that point that it is
> + * nested. Then we pop the stack normally with iret, and if there was
> + * a nested NMI that updated the copy interrupt stack frame, a
> + * jump will be made to the repeat_nmi code that will handle the second
> + * NMI.
> + */
> +
> + /* Use %rdx as out temp variable throughout */
> + pushq_cfi %rdx
> +
> + /*
> + * Check the special variable on the stack to see if NMIs are
> + * executing.
> + */
> + cmp $1, -8(%rsp)
> + je nested_nmi
> +
> + /*
> + * Now test if the previous stack was an NMI stack.
> + * We need the double check. We check the NMI stack to satisfy the
> + * race when the first NMI clears the variable before returning.
> + * We check the variable because the first NMI could be in a
> + * breakpoint routine using a breakpoint stack.
> + */
> + lea 6*8(%rsp), %rdx
> + test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
> +
> +nested_nmi:
> + /*
> + * Do nothing if we interrupted the fixup in repeat_nmi.
> + * It's about to repeat the NMI handler, so we are fine
> + * with ignoring this one.
> + */
> + movq $repeat_nmi, %rdx
> + cmpq 8(%rsp), %rdx
> + ja 1f
> + movq $end_repeat_nmi, %rdx
> + cmpq 8(%rsp), %rdx
> + ja nested_nmi_out
> +
> +1:
> + /* Set up the interrupted NMIs stack to jump to repeat_nmi */
> + leaq -6*8(%rsp), %rdx
> + movq %rdx, %rsp
> + pushq $__KERNEL_DS
> + pushq %rdx
> + pushfq
> + pushq $__KERNEL_CS

These probably need CFI annotations.

> + pushq_cfi $repeat_nmi
> +
> + /* Put stack back */
> + addq $(11*8), %rsp
> +
> +nested_nmi_out:
> + popq_cfi %rdx
> +
> + /* No need to check faults here */
> + INTERRUPT_RETURN
> +
> +first_nmi:
> + /*
> + * Because nested NMIs will use the pushed location that we
> + * stored rdx, we must keep that space available.

s/stored rdx/stored in rdx

> +restart_nmi:
> + pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */
> subq $ORIG_RAX-R15, %rsp
> CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
> + /*
> + * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
> + * as we should not be calling schedule in NMI context.
> + * Even with normal interrupts enabled. An NMI should not be
> + * setting NEED_RESCHED or anything that normal interrupts and
> + * exceptions might do.
> + */

Note that the IRQ return checks are needed because NMI path can
set the irq-work TIF. Might be worth putting into the comment -
NMIs are not *entirely* passive entities.

> + /* copy the saved stack back to copy stack */
> + .rept 5
> + pushq 4*8(%rsp)

Probably needs CFI annotation as well.

> dotraplinkage notrace __kprobes void
> do_nmi(struct pt_regs *regs, long error_code)
> {
> + nmi_preprocess(regs);
> +
> nmi_enter();
>
> inc_irq_stat(__nmi_count);
> @@ -416,6 +515,8 @@ do_nmi(struct pt_regs *regs, long error_code)
> default_do_nmi(regs);
>
> nmi_exit();
> +
> + nmi_postprocess();

Small naming nit: would be nice if the nmi_postprocess() naming
indicated the connection to the preprocess block - in particular
the retry loop which has the potential for an infinite loop.

Something like nmi_postprocess_retry_preprocess()?

Looks good otherwise.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/