Re: [PATCH] arm64: enable GENERIC_FIND_FIRST_BIT

From: Alexander Lobakin
Date: Thu Feb 25 2021 - 06:54:59 EST


From: Yury Norov <yury.norov@xxxxxxxxx>
Date: Wed, 24 Feb 2021 07:44:16 -0800

> On Wed, Feb 24, 2021 at 11:52:55AM +0000, Alexander Lobakin wrote:
> > From: Yury Norov <yury.norov@xxxxxxxxx>
> > Date: Sat, 5 Dec 2020 08:54:06 -0800
> >
> > Hi,
> >
> > > ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
> > > enable it in config. It leads to using find_next_bit() which is less
> > > efficient:
> > >
> > > 0000000000000000 <find_first_bit>:
> > > 0: aa0003e4 mov x4, x0
> > > 4: aa0103e0 mov x0, x1
> > > 8: b4000181 cbz x1, 38 <find_first_bit+0x38>
> > > c: f9400083 ldr x3, [x4]
> > > 10: d2800802 mov x2, #0x40 // #64
> > > 14: 91002084 add x4, x4, #0x8
> > > 18: b40000c3 cbz x3, 30 <find_first_bit+0x30>
> > > 1c: 14000008 b 3c <find_first_bit+0x3c>
> > > 20: f8408483 ldr x3, [x4], #8
> > > 24: 91010045 add x5, x2, #0x40
> > > 28: b50000c3 cbnz x3, 40 <find_first_bit+0x40>
> > > 2c: aa0503e2 mov x2, x5
> > > 30: eb02001f cmp x0, x2
> > > 34: 54ffff68 b.hi 20 <find_first_bit+0x20> // b.pmore
> > > 38: d65f03c0 ret
> > > 3c: d2800002 mov x2, #0x0 // #0
> > > 40: dac00063 rbit x3, x3
> > > 44: dac01063 clz x3, x3
> > > 48: 8b020062 add x2, x3, x2
> > > 4c: eb02001f cmp x0, x2
> > > 50: 9a829000 csel x0, x0, x2, ls // ls = plast
> > > 54: d65f03c0 ret
> > >
> > > ...
> > >
> > > 0000000000000118 <_find_next_bit.constprop.1>:
> > > 118: eb02007f cmp x3, x2
> > > 11c: 540002e2 b.cs 178 <_find_next_bit.constprop.1+0x60> // b.hs, b.nlast
> > > 120: d346fc66 lsr x6, x3, #6
> > > 124: f8667805 ldr x5, [x0, x6, lsl #3]
> > > 128: b4000061 cbz x1, 134 <_find_next_bit.constprop.1+0x1c>
> > > 12c: f8667826 ldr x6, [x1, x6, lsl #3]
> > > 130: 8a0600a5 and x5, x5, x6
> > > 134: ca0400a6 eor x6, x5, x4
> > > 138: 92800005 mov x5, #0xffffffffffffffff // #-1
> > > 13c: 9ac320a5 lsl x5, x5, x3
> > > 140: 927ae463 and x3, x3, #0xffffffffffffffc0
> > > 144: ea0600a5 ands x5, x5, x6
> > > 148: 54000120 b.eq 16c <_find_next_bit.constprop.1+0x54> // b.none
> > > 14c: 1400000e b 184 <_find_next_bit.constprop.1+0x6c>
> > > 150: d346fc66 lsr x6, x3, #6
> > > 154: f8667805 ldr x5, [x0, x6, lsl #3]
> > > 158: b4000061 cbz x1, 164 <_find_next_bit.constprop.1+0x4c>
> > > 15c: f8667826 ldr x6, [x1, x6, lsl #3]
> > > 160: 8a0600a5 and x5, x5, x6
> > > 164: eb05009f cmp x4, x5
> > > 168: 540000c1 b.ne 180 <_find_next_bit.constprop.1+0x68> // b.any
> > > 16c: 91010063 add x3, x3, #0x40
> > > 170: eb03005f cmp x2, x3
> > > 174: 54fffee8 b.hi 150 <_find_next_bit.constprop.1+0x38> // b.pmore
> > > 178: aa0203e0 mov x0, x2
> > > 17c: d65f03c0 ret
> > > 180: ca050085 eor x5, x4, x5
> > > 184: dac000a5 rbit x5, x5
> > > 188: dac010a5 clz x5, x5
> > > 18c: 8b0300a3 add x3, x5, x3
> > > 190: eb03005f cmp x2, x3
> > > 194: 9a839042 csel x2, x2, x3, ls // ls = plast
> > > 198: aa0203e0 mov x0, x2
> > > 19c: d65f03c0 ret
> > >
> > > ...
> > >
> > > 0000000000000238 <find_next_bit>:
> > > 238: a9bf7bfd stp x29, x30, [sp, #-16]!
> > > 23c: aa0203e3 mov x3, x2
> > > 240: d2800004 mov x4, #0x0 // #0
> > > 244: aa0103e2 mov x2, x1
> > > 248: 910003fd mov x29, sp
> > > 24c: d2800001 mov x1, #0x0 // #0
> > > 250: 97ffffb2 bl 118 <_find_next_bit.constprop.1>
> > > 254: a8c17bfd ldp x29, x30, [sp], #16
> > > 258: d65f03c0 ret
> > >
> > > Enabling this functions would also benefit for_each_{set,clear}_bit().
> > > Would it make sense to enable this config for all such architectures by
> > > default?
> >
> > I confirm that GENERIC_FIND_FIRST_BIT also produces more optimized and
> > fast code on MIPS (32 R2) where there is also no architecture-specific
> > bitsearching routines.
> > So, if it's okay for other folks, I'd suggest to go for it and enable
> > for all similar arches.
>
> As far as I understand the idea of GENERIC_FIND_FIRST_BIT=n, it's
> intended to save some space in .text. But in fact it bloats the
> kernel:
>
> yury:linux$ scripts/bloat-o-meter vmlinux vmlinux.ffb
> add/remove: 4/1 grow/shrink: 19/251 up/down: 564/-1692 (-1128)
> ...

Same for MIPS, enabling GENERIC_FIND_FIRST_BIT saves a bunch of .text
memory despite that it introduces new entries.

> For the next cycle, I'm going to submit a patch that removes the
> GENERIC_FIND_FIRST_BIT completely and forces all architectures to
> use find_first{_zero}_bit()

I like that idea. I'm almost sure there'll be no arch that benefits
from CONFIG_GENERIC_FIND_FIRST_BIT=n (and has no arch-optimized
versions).

> > (otherwise, I'll publish a separate entry for mips-next after 5.12-rc1
> > release and mention you in "Suggested-by:")
>
> I think it worth to enable GENERIC_FIND_FIRST_BIT for mips and arm now
> and see how it works for people. If there'll be no complains I'll remove
> the config entirely. I'm OK if you submit the patch for mips now, or we
> can make a series and submit together. Works either way.

Lez make a series and see how it goes. I'll send you MIPS part soon.

Al