Re: x86 memcpy performance

From: Borislav Petkov
Date: Mon Aug 15 2011 - 12:12:17 EST

Next message: Randy Dunlap: "Re: [PATCH] drivers: base: platform.c: Fix warning on make htmldocs"
Previous message: Stephen Warren: "RE: [RFC PATCH 10/12] arm/tegra: Add device tree support to pinmuxdriver"
In reply to: Andrew Lutomirski: "Re: x86 memcpy performance"
Next in thread: Andrew Lutomirski: "Re: x86 memcpy performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote:
>> But still, irq_fpu_usable() still checks !in_interrupt() which means
>> that we don't want to run SSE instructions in IRQ context. OTOH, we
>> still are fine when running with CR0.TS. So what happens when we get an
>> #NM as a result of executing an FPU instruction in an IRQ handler? We
>> will have to do init_fpu() on the current task if the last hasn't used
>> math yet and do the slab allocation of the FPU context area (I'm looking
>> at math_state_restore, btw).
>
> IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in
> an interrupt and TS=1, when we know that we're not in a
> kernel_fpu_begin section, so it's safe to start one (and do clts).

Doh, yes, I see it now. This way we save the math state of the current
process if needed and "disable" #NM exceptions until kernel_fpu_end() by
clearing CR0.TS, sure. Thanks.

> IMO this code is not very good, and I plan to fix it sooner or later.

Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
You could probably reuse some bits from there. The patchset should be in
tip/x86/xsave.

> I want kernel_fpu_begin (or its equivalent*) to be very fast and
> usable from any context whatsoever. Mucking with TS is slower than a
> complete save and restore of YMM state.

Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
This would obviate the need to muck with contexts but that could get
expensive wrt stack operations. The advantage is that I'm not dealing
with the whole FPU state but only with 16 XMM regs. I should probably
dust off that version again and retest.

Or, if we want to use SSE stuff in the kernel, we might think of
allocating its own FPU context(s) and handle those...

> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.

Well, do we want to use floating point instructions in the kernel?

Thanks.

--
Regards/Gruss,
Boris.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Randy Dunlap: "Re: [PATCH] drivers: base: platform.c: Fix warning on make htmldocs"
Previous message: Stephen Warren: "RE: [RFC PATCH 10/12] arm/tegra: Add device tree support to pinmuxdriver"
In reply to: Andrew Lutomirski: "Re: x86 memcpy performance"
Next in thread: Andrew Lutomirski: "Re: x86 memcpy performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]