Re: [PATCH 0/3] ARM ZSTD boot compression

From: Nick Terrell
Date: Fri Oct 20 2023 - 14:53:54 EST




> On Oct 12, 2023, at 6:27 PM, J. Neuschäfer <j.neuschaefer@xxxxxxx> wrote:
>
> On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote:
>>> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@xxxxxxx> wrote:
>>> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>>>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>>>>
>>>>>> - LZO: 7.2 MiB, 6 seconds
>>>>>> - ZSTD: 5.6 MiB, 60 seconds
> [...]
>>> For ZSTD as used in kernel decompression (the zstd22 configuration), the
>>> window is even bigger, 128 MiB. (AFAIU)
>>
>> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
>>
>> But this is totally configurable. You can switch compression configurations
>> at any time. If you believe that the window size is the issue causing speed
>> regressions, you could use a zstd compression to use a e.g. 256KB window
>> size like this:
>>
>> zstd -19 --zstd=wlog=18
>>
>> This will keep the same algorithm search strength, but limit the decoder memory
>> usage.
>
> Noted.
>
>> I will also try to get this patchset working on my machine, and try to debug.
>> The 10x slower speed difference is not expected, and we see much better speed
>> in userspace ARM. I suspect it has something to do with the preboot environment.
>> E.g. when implementing x86-64 zstd kernel decompression, I noticed that
>> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
>> penalty.
>
> In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
> only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
> think the main culprit here was particularly bad luck in my choice of
> test hardware.
>
> The inlining issues are a good point, noted for the next time I work on this.

I went out and bought a Raspberry Pi 4 to test on. I’ve done some crude measurements
and see that zstd kernel decompression is just slightly slower than gzip kernel
decompression, and about 2x slower than lzo. In userspace decompression of the same
file (a manually compressed kernel image) I see that zstd decompression is significantly
faster than gzip. So it is definitely something about the preboot boot environment, or how
the code is compiled for the preboot environment that is causing the issue.

My next step is to set up qemu on my Pi to try to get some perf measurements of the
decompression. One thing I’ve really been struggling with, and what thwarted my last
attempts at adding ARM zstd kernel decompression, was getting preboot logs printed.

I’ve figured out I need CONFIG_DEBUG_LL=y, but I’ve yet to actually get any logs.
And I can’t figure out how to get it working in qemu. I haven’t tried qemu on an ARM
host with kvm, but that’s the next thing I will try.

Do you happen to have any advice about how to get preboot logs in qemu? Is it
possible only on an ARM host, or would it also be possible on an x86-64 host?

Thanks,
Nick Terrell

> Thanks,
> Jonathan