RE: [PATCH 1/4] bitops: Add single_bit_set()

From: David Laight
Date: Tue Nov 23 2021 - 09:36:43 EST


From: 'Andy Shevchenko'
> Sent: 23 November 2021 13:43
>
> On Tue, Nov 23, 2021 at 10:58:44AM +0000, David Laight wrote:
> > From: Andy Shevchenko
> > > On Tue, Nov 23, 2021 at 10:42:45AM +0000, David Laight wrote:
> > > > From: Vaittinen, Matti
> > > > > Sent: 22 November 2021 13:19
> > > > > On 11/22/21 14:57, Andy Shevchenko wrote:
> > > > > > On Mon, Nov 22, 2021 at 12:42:21PM +0000, Vaittinen, Matti wrote:
> > > > > >> On 11/22/21 13:28, Andy Shevchenko wrote:
> > > > > >>> On Mon, Nov 22, 2021 at 01:03:25PM +0200, Matti Vaittinen wrote:
> > > > > >
> > > > > > What do you mean by this?
> > > > > >
> > > > > > hweight() will return you the number of the non-zero elements in the set.
> > > > >
> > > > > Exactly. The function I added did only check if given set of bits had
> > > > > only one bit set.
> > > >
> > > > Checking for exactly one bit can use the (x & (x - 1)) check on
> > > > non-zero values - which may even be better on some cpus with a
> > > > popcnt instruction.
> > >
> > > In the discussed case the value pretty much can be 0, meaning you have
> > > to add an additional test which I believe diminishes all efforts for
> > > the is_power_of_2() call.
> >
> > I wouldn't have thought so.
> > Code would be:
> > if (!scan_for_non_zero())
> > return 0;
> > if (!is_power_of_2())
> > return 0;
> > return scan_for_non_zero() ? 0 : 1;
> >
> > Hand-crafting asm you'd actually check for (x - 1) generating
> > carry in the initial scan.
>
> Have you done any benchmarks? Can we see them?
>
> > The latency of popcnt it worse than arithmetic on a lot of x86 cpu.

Well, on AMD piledriver and bulldozer (etc) 64bit popcnt has a latency of 4.
On bobcat the latency is 12.
Excavator and Ryzen are better.
Intel are ok except for the Atoms (silvermont/goldmont).
That isn't going to help.

But run on a cpu without a popcnt instruction and the performance will
really be horrid.
At best the gain for using popcnt is marginal.

If you want to try a benchmark then code up (and debug):
%rsi = buf + length // pointer to end of bitmap
%rcx = -length // in bytes
1: jrcxz 8f // jumps if all zeros
mov (%rsi, %rcx),%rax
mov %rax, %rdx,
sub $1, %rax
lea 8(%rcx), %rcx
jc 1b // jump if zero word
and %rdx, %rax
jnz 8f // jump if >1 bit set
2: jrcxz 9f
cmp (%rsi, %rcx), %rax
lea 8(%rcx), %rcx
jz 2b
8: xor %eax,%eax
ret
9: int %eax
ret

I think that is (about) right).
The initial loop may be 3 clocks per iteration on a recent Intel cpu.

But I suspect the only real gains are on cpu without popcnt.
It isn't as though you'll be doing this as often as (say)
the IP checksum function - which I have benchmarked.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)