RE: [PATCH] objtool,x86: Teach decode about LOOP* instructions

From: David Laight
Date: Wed Sep 07 2022 - 07:14:06 EST


From: Peter Zijlstra
> Sent: 07 September 2022 10:40
>
> On Wed, Sep 07, 2022 at 09:06:12AM +0000, David Laight wrote:
> > From: Peter Zijlstra
> > > Sent: 07 September 2022 10:01
> > >
> > > On Wed, Sep 07, 2022 at 09:06:45AM +0200, Peter Zijlstra wrote:
> > > > On Wed, Sep 07, 2022 at 09:55:21AM +0900, Masami Hiramatsu (Google) wrote:
> > > >
> > > > > +/* Return the jump target address or 0 */
> > > > > +static inline unsigned long insn_get_branch_addr(struct insn *insn)
> > > > > +{
> > > > > + switch (insn->opcode.bytes[0]) {
> > > > > + case 0xe0: /* loopne */
> > > > > + case 0xe1: /* loope */
> > > > > + case 0xe2: /* loop */
> > > >
> > > > Oh cute, objtool doesn't know about those, let me go add them.
> >
> > Do they ever appear in the kernel?
>
> No; that is, not on any of the random vmlinux.o images I checked this
> morning.
>
> Still, best to properly decode them anyway.

It is annoying that cpu with adox/adcx have slow loop.
You really want to be able to do:
1: adox ...
adcx ...
loop 1b
That would never run with one iteration/clock.
But unrolling once would probably be enough.

What you can do (and gives the fastest IPcsum loop) is:
1: jcxz 2f
....
lea %rcx,...
jmp 1b
2:
The extra instructions mean that needs unrolling 4 times.
I've got over 12 bytes/clock that way.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)