RE: [PATCH v9 00/27] x86: load FPU registers on return to userland

From: David Laight
Date: Thu Apr 04 2019 - 11:09:48 EST


From: Sebastian Andrzej Siewior
> From: Andy Lutomirski [mailto:luto@xxxxxxxxxx]
> Sent: 04 April 2019 15:27
>
> On Thu, Apr 4, 2019 at 7:14 AM Sebastian Andrzej Siewior
> <bigeasy@xxxxxxxxxxxxx> wrote:
> >
> > On 2019-04-04 14:01:43 [+0000], David Laight wrote:
> > > From: Sebastian Andrzej Siewior
> > > > Sent: 03 April 2019 17:41
> > > ...
> > > > To access the FPU registers in kernel we need:
> > > > - disable preemption to avoid that the scheduler switches tasks. By
> > > > doing so it would set TIF_NEED_FPU_LOAD and the FPU registers would be
> > > > not valid.
> > > > - disable BH because the softirq might use kernel_fpu_begin() and then
> > > > set TIF_NEED_FPU_LOAD instead loading the FPU registers on completion.
> > >
> > > Is there a possible optimisation here for kernel threads?
> > > Since there is no 'user FP state' the 'kernel FP state' can
> > > be saved by a task switch or softirq.
> >
> > There is no such thing as "kernel FP state" that is saved.
> >
>
> I think that David was asking whether we could make kernel_fpu_begin()
> regions sometimes be preemptible. The answer is presumably yes, but I
> think that should be a separate effort, and it should be justified
> with improved performance above and beyond what we get with Jason's
> simd_get() stuff.

Yep...

- Actually there are some loops that process more or less arbitrary
- amounts of data. But all of those somewhat manually break that up
- into page size chunks before checking if we've disabled preemption
- for too long, and then if so, do a end()begin() sequence before
- starting the next chunk. My simd_relax takes care of that, for
- example. Given that a long term purpose of this patchset is to
- obsolete the simd_get/put/relax API I've proposed, it seems like
- it might be nice to also do away with the manual relaxation
- requirement, if that's somehow possible.

The advantage of a 'relax' (or kernel_fpu_end()/begin() pair)
is that the kernel registers never need saving.
OTOH allowing arbitrary context switches is (probably) better
for latency.

As well as kernel threads being able to use the tasks 'normal' fpu
save area, it ought to be possible to pass an 'fpu save area'
to kernel_fpu_begin() so that a non-kernel thread can be pre-empted
while using the fpu.

Actually you could (probably) just pass the address of a location
where the address of a kmalloc()ed save area would be written
were one needed and allocate the structure the first time it
is needed (although kmalloc() in that code path might be hard.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)