Re: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value()

From: Yury Norov
Date: Tue Jul 25 2023 - 01:04:42 EST


On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote:
> On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote:
> > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote:
>
> > > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset);
> >
> > 'GENMASK(nbits - 1, 0) << offset' looks really silly.
>
> But you followed the thread to get a clue why it's written in this form, right?

Yes, I did. But I don't expect everyone looking at kernel code would spend
time recovering discussions that explain why that happened. So, at least it
would be fine to drop a comment.

> ...
>
> > With all that I think the implementation should look something like
> > this:
>
> I would go this way if and only if the code generation on main architectures
> with both GCC and clang is better.
>
> And maybe even some performance tests need to be provided.

For the following implementation:

void my_bitmap_write(unsigned long *map, unsigned long value,
unsigned long start, unsigned long nbits)
{
unsigned long w, end;

if (unlikely(nbits == 0))
return;

value &= GENMASK(nbits - 1, 0);

map += BIT_WORD(start);
start %= BITS_PER_LONG;
end = start + nbits - 1;

w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start));
*map = w | (value << start);

if (end < BITS_PER_LONG)
return;

w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG);
*map = w | (value >> (BITS_PER_LONG - start));
}

This is the bloat-o-meter output:

$ scripts/bloat-o-meter lib/test_bitmap.o.orig lib/test_bitmap.o
add/remove: 8/0 grow/shrink: 1/0 up/down: 2851/0 (2851)
Function old new delta
test_bitmap_init 3846 5484 +1638
test_bitmap_write_perf - 401 +401
bitmap_write - 271 +271
my_bitmap_write - 248 +248
bitmap_read - 229 +229
__pfx_test_bitmap_write_perf - 16 +16
__pfx_my_bitmap_write - 16 +16
__pfx_bitmap_write - 16 +16
__pfx_bitmap_read - 16 +16
Total: Before=36964, After=39815, chg +7.71%

And this is the performance test:

for (cnt = 0; cnt < 5; cnt++) {
time = ktime_get();
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
for (i = 0; i < 1000; i++) {
if (i + nbits > 1000)
break;
bitmap_write(bmap, val, i, nbits);
}
}
time = ktime_get() - time;
pr_err("bitmap_write:\t%llu\t", time);

time = ktime_get();
for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) {
for (i = 0; i < 1000; i++) {
if (i + nbits > 1000)
break;
my_bitmap_write(bmap, val, i, nbits);
}
}
time = ktime_get() - time;
pr_cont("%llu\n", time);
}

Which on x86_64/kvm with GCC gives:
Orig My
[ 1.630731] test_bitmap: bitmap_write: 299092 252764
[ 1.631584] test_bitmap: bitmap_write: 299522 252554
[ 1.632429] test_bitmap: bitmap_write: 299171 258665
[ 1.633280] test_bitmap: bitmap_write: 299241 252794
[ 1.634133] test_bitmap: bitmap_write: 306716 252934

So, it's ~15% difference in performance and 8% in size.

I don't insist on my implementation, but I think, we'd experiment for more
with code generation.

Thanks,
Yury