Re: [PATCH v2 05/15] x86/alternatives: Use optimized NOPs for padding

From: Ingo Molnar
Date: Wed Mar 04 2015 - 01:43:30 EST



* Borislav Petkov <bp@xxxxxxxxx> wrote:

> From: Borislav Petkov <bp@xxxxxxx>
>
> Alternatives allow now for an empty old instruction. In this case we go
> and pad the space with NOPs at assembly time. However, there are the
> optimal, longer NOPs which should be used. Do that at patching time by
> adding alt_instr.padlen-sized NOPs at the old instruction address.
>
> Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> Signed-off-by: Borislav Petkov <bp@xxxxxxx>
> ---
> arch/x86/kernel/alternative.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 715af37bf008..af397cc98d05 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -323,6 +323,14 @@ done:
> n_dspl, (unsigned long)orig_insn + n_dspl + repl_len);
> }
>
> +static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr)
> +{
> + add_nops(instr + (a->instrlen - a->padlen), a->padlen);

So while looking at this patch I was wondering about the following
question: right now add_nops() does the obvious 'fill with large NOPs
first, then fill the remaining bytes with a smaller NOP' logic:

/* Use this to add nops to a buffer, then text_poke the whole buffer. */
static void __init_or_module add_nops(void *insns, unsigned int len)
{
while (len > 0) {
unsigned int noplen = len;
if (noplen > ASM_NOP_MAX)
noplen = ASM_NOP_MAX;
memcpy(insns, ideal_nops[noplen], noplen);
insns += noplen;
len -= noplen;
}
}

this works perfectly fine, but I'm wondering how current decoders work
when a large NOP crosses a cache line boundary or a page boundary. Is
there any inefficiency in that case, and if yes, could we avoid that
by not spilling NOPs across cachelines or page boundaries?

With potentially thousands of patched instructions both situations are
bound to occur dozens of times in the cacheline case, and a few times
in the page boundary case.

There's also the following special case, of a large NOP followed by a
small NOP, when the number of NOPs would not change if we padded
differently:

[ large NOP ][smaller NOP]
[ cacheline 1 ][ cacheline 2 ]

which might be more optimally filled with two mid-size NOPs:

[ midsize NOP ][ midsize NOP ]
[ cacheline 1 ][ cacheline 2 ]

So that any special boundary is not partially covered by a NOP
instruction.

But the main question is, do such alignment details ever matter to
decoder performance?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/