Re: [PATCH v3 2/2] proc: add /proc/<pid>/arch_state

From: Dave Martin
Date: Fri Nov 23 2018 - 12:11:12 EST


On Thu, Nov 22, 2018 at 09:40:24AM +0800, Li, Aubrey wrote:
> On 2018/11/21 17:53, Peter Zijlstra wrote:
> > On Wed, Nov 21, 2018 at 09:19:36AM +0100, Peter Zijlstra wrote:
> >> On Wed, Nov 21, 2018 at 09:39:00AM +0800, Li, Aubrey wrote:
> >>>> Also; you were going to shop around with the other architectures to see
> >>>> what they want/need for this interface. I see nothing on that.
> >>>>
> >>> I'm open for your suggestion, :)
> >>
> >> Well, we have linux-arch and the various maintainers are also listed in
> >> MAINTAINERS. Go forth and ask..
> >
> > Ok, so I googled a wee bit (you could have too).
> >
> > There's not that many architectures that build big hot chips
> > (powerpc,x86,arm64,s390) (mips, sparc64 and ia64 are pretty dead I
> > think, although the Fujitsu Sparc M10 X+/X SIMD looked like it could be
> > 'fun').
> >
> > Of those, powerpc altivec doesn't seem to be very wide, but you'd have
> > to ask the power folks. Same for s390 z13.
> >
> > The Fujitsu/ARM64-SVE stuff looks like it can be big and hot.
> >
> > And RISC-V has was vector extention, but I don't think anybody is
> > actually building big hot versions of that just yet.
> >
> Thanks Peter. Add more maintainers here.
>
> On some x86 architectures, the tasks using simd instruction(AVX512 particularly)
> need to be dealt with specially against the tasks not using simd instruction.
> I proposed an interface to expose such CPU specific information for the user
> space tools to apply different scheduling policies.
>
> The interface can be refined to be the format as /proc/<pid>/status. Not sure
> if it's useful to any other architectures.
>
> Welcome any comments.

For SVE:

We currently monitor SVE use by trapping only. We also made an ABI
decision that a syscall throws away the task's SVE state -- this
falls out naturally from the fact that the SVE state is caller-save
for regular function calls in the AArch64 ABI.

There isn't an explicit means like VZEROUPPER for userspace to
mark the SVE state as non-live without entering the kernel today.

Currently I expose as little detail to userspace as possible regarding
how/when SVE is enabled/disabled or used.


For the /proc interface:

It would be nice to expose some information to userspace about when/
where major hardware functional units are in use, but beyond the
information already supplied by hardware perf events, it's not
obvious what should be exposed.

AFAICT, the exposed flags would be partly an arbitrary artifact of
kernel implementation details: i.e., how often and when the kernel
saves/restores the task's state may affect the pattern of observed
values in non-trivial ways.

For SVE today, a task that does a lot of syscalls may appear to be using
SVE less than a second task that does fewer syscalls but is otherwise
identical -- simply because a syscall is our only way to detect that
SVE is not in use today.


This kind of issue means that userspace may struggle to make good
decisions using this data: instead it's going to rely on some kind of
tuning which may become wrong as soon as the workload, kernel version
or hardware changes.


A /proc/<pid>/file would need to be polled (which doesn't sound great)
and also suffers from all the usual /proc raciness.

Cheers
---Dave