Re: [PATCH 0/2] Add support for ZSTD-compressed kernel

From: Adam Borowski
Date: Tue Oct 10 2017 - 23:13:08 EST


On Wed, Oct 11, 2017 at 02:01:41AM +0000, Nick Terrell wrote:
> On 10/10/17, 5:08 PM, "Adam Borowski" <kilobyte@xxxxxxxxxx> wrote:
> > On Tue, Oct 10, 2017 at 10:40:13PM +0000, Nick Terrell wrote:
> > > On 10/10/17, 2:56 PM, "hpa@xxxxxxxxx" <hpa@xxxxxxxxx> wrote:
> > > >On October 10, 2017 2:22:42 PM PDT, Nick Terrell <terrelln@xxxxxx> wrote:
> > > >>This patch set adds support for a ZSTD-compressed kernel and ramdisk
> > > >>images in the kernel boot process. It only integrates the support with
> > > >>x86, though the first patch is generic to all architectures.
>
> > > Comparing the command line tools on a kernel image that is 68970616 B large:
> > >
> > > | Algorithm | Compression Ratio | Decompression MB/s |
> > > |-----------|-------------------|--------------------|
> > > | zstd | 4.42 | 436.5 |
> > > | gzip | 3.72 | 134.1 |
> > > | xz | 4.83 | 53.1 |
> > > | lz4 | 3.18 | 1682.2 |
> > > | lzo | 3.36 | 389.6 |
> > > | bzip2 | 4.03 | 33.3 |
>
> > Perhaps it'd be a good idea to cull some of bad algorithms? I don't know
> > the memory used by those, but envelope of the table you just shown suggests
> > using bzip2 and lzo is pointless. So is gzip, but it's widespread as the
> > default for initramfs producers, thus it'd be unsafe to kill it.
>
> I'm not sure there is a great use case for bzip2. It requires more memory
> than xz, compresses worse, and decompresses slower. lzo in the kernel might
> decompress a bit faster than zstd (looking back at the BtrFS benchmarks, it
> did). More importantly, it uses less memory than zstd. When decompressing
> the kernel zstd only needs 192 KB, but for initramfs, it will need more.
> Still, unless you really need 5% more compression, lz4 is probably a better
> option than lzo for speed.

The main reason I'm raising this question is, bzip2 has no other users in
the kernel, thus removing support for it would allow us to delete its code.

As for lzo, even if there are cases when it's a bit faster (overcoming the
9% lead zstd has in your userspace benchmark), it wouldn't be faster by
much -- certainly nowhere to make anyone want the massive compression loss.

Thus, let's enumerate use cases:
* a fast modern machine: zstd boots a second faster at the cost of 1.3MB
image size; doesn't matter much either way. No one wants weaker options.
* a weaker machine: zstd rapidly gains as boot times rise.
* a very slow machine with fast I/O that boots often: lz4.
* disk space at extreme premium: xz.

In no case I can think of other algorithms are the rational choice.
There's little gain in removing the rest, though: the code is used for other
purposes in the kernel, and doesn't get compiled in unless manually
selected.

Zstd looks like a good default, but it's nowhere near mature enough: this
patch is x86-only, tools which people may use to analyze compressed kernels
don't know about zstd yet, etc.

Thus, I'd propose the following plan:
* add zstd
* remove bzip2
* update recommendations (Kconfig text)
* in a year or two, consider making zstd the default

> > > I know that this isn't a real benchmark of the kernel decompression. I
> > > still need to figure out how to time the kernel decompression. If you have
> > > any suggestions let me know. Otherwise, I'll get back to you when I've
> > > figured out how to run the benchmark.
>
> I've found a way to benchmark the kernel decompression time during boot
> with QEMU. I add timestamps to every line of the output. I also had to
> print 100 lines before the decompression starts to get consistent results.
>
> I've found that zstd is decompressing 2x slower than it should. I narrowed
> down the problem to ZSTD_wildcopy() and ZSTD_copy8() in
> lib/zstd/zstd_internal.h. ZSTD_wildcopy() calls memcpy(src, dst, 8) in
> a loop and doesn't handle the freestanding memcpy() well. Replacing it with
> __builtin_mcmpy(src, dst, 8) doubles the speed.

... and by maturing I meant issues like this.

> I'm not an expert in freestanding gcc compilation, but I believe it is okay
> to call __builtin_memcpy() in freestanding mode, and gcc will either
> inline it, or add the right function call. The difference being that gcc
> will be able to apply its memcpy() analysis. I also see that
> arch/x86/boot/string.h defines memcpy() to __builtin_memcpy. Is it safe to
> directly use __builtin_memcpy() in lib/zstd/zstd_internal.h?

Try it, and see if it builds and fails to crash. :)


Meow!
--
âââââââ We domesticated dogs 36000 years ago; together we chased
âââââââ animals, hung out and licked or scratched our private parts.
âââââââ Cats domesticated us 9500 years ago, and immediately we got
âââââââ agriculture, towns then cities. -- whitroth on /.