Re: [PATCH] riscv: Optimize memset

From: Andrew Jones
Date: Tue May 09 2023 - 05:16:41 EST


On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote:
> From: zhangfei <zhangfei@xxxxxxxxxxxxxx>
>
> > > 5:
> > > - sb a1, 0(t0)
> > > - addi t0, t0, 1
> > > - bltu t0, a3, 5b
> > > + sb a1, 0(t0)
> > > + sb a1, -1(a3)
> > > + li a4, 2
> > > + bgeu a4, a2, 6f
> > > +
> > > + sb a1, 1(t0)
> > > + sb a1, 2(t0)
> > > + sb a1, -2(a3)
> > > + sb a1, -3(a3)
> > > + li a4, 6
> > > + bgeu a4, a2, 6f
> > > +
> > > + sb a1, 3(t0)
> > > + sb a1, -4(a3)
> > > + li a4, 8
> > > + bgeu a4, a2, 6f
> >
> > Why is this check here?
>
> Hi,
>
> I filled head and tail with minimal branching. Each conditional ensures that
> all the subsequently used offsets are well-defined and in the dest region.

I know. You trimmed my comment, so I'll quote myself, here

"""
After the check of a2 against 6 above we know that offsets 6(t0)
and -7(a3) are safe. Are we trying to avoid too may redundant
stores with these additional checks?
"""

So, again. Why the additional check against 8 above and, the one you
trimmed, checking 10?

>
> Although this approach may result in redundant storage, compared to byte by
> byte storage, it allows storage instructions to be executed in parallel and
> reduces the number of jumps.

I understood that when I read the code, but text like this should go in
the commit message to avoid people having to think their way through
stuff.

>
> I used the code linked below for performance testing and commented on the memset
> that calls the arm architecture in the code to ensure it runs properly on the
> risc-v platform.
>
> [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53
>
> The testing platform selected RISC-V SiFive U74.The test data is as follows:
>
> Before optimization
> ---------------------
> Random memset (bytes/ns):
> memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30
>
> Medium memset (bytes/ns):
> memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
> memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40
>
> After optimization
> ---------------------
> Random memset bytes/ns):
> memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31
> Medium memset (bytes/ns )
> memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
> memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40
>
> From the results, it can be seen that memset has significantly improved its performance with
> a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns.

And these benchmark results belong in the cover letter, which this series
is missing.

Thanks,
drew