Re: [PATCH v5 0/3]: lib/lzo: run-length encoding support

From: Tao Liu
Date: Thu Mar 07 2024 - 22:26:08 EST


Hi Dave,

On Tue, Feb 05, 2019 at 03:59:59PM +0000, Dave Rodgman wrote:
> Hi,
>
> Following on from the previous lzo-rle patchset:
>
> https://lkml.org/lkml/2018/11/30/972
>
> This patchset contains only the RLE patches, and should be applied on top of
> the non-RLE patches ( https://lkml.org/lkml/2019/2/5/366 ).
>

Sorry for the interruption, since it is an old patchset and discussion.
I have a few questions on lzo-rle support, hope you can give me some
directions, thanks in advance!

1) Is lzo-rle suitable for userspace library? I've checked the current
userspace lzo library lzo-2.10, it seems no lzo-rle support (Please
correct me if I'm wrong). If lzo-rle have better performance in kernel,
then is it possible to implement one in userspace and gain better
performance as well?

2) Currently Yulong TANG have encountered problem that, crash utility
cannot decompress a lzo-rle compressed zram since kernel 5.1 [1], since
there is no lzo-rle support for current lzo library, crash have to
import the kernel source code directly into crash, which is not good for
crash utility code maintainance. It will be better if we can update lzo
library with lzo-rle support. I guess not only crash, but also other
kernel debugging tools running in userspace such as drgn may also need
this feature.

Do you have any suggestions on for these?

[1]: https://www.mail-archive.com/devel@xxxxxxxxxxxxxxxxxxxxxxxxxxx/msg00475.html

Thanks,
Tao Liu


>
> Previously, some questions were raised around the RLE patches. I've done some
> additional benchmarking to answer these questions. In short:
>
> - RLE offers significant additional performance (data-dependent)
> - I didn't measure any regressions that were clearly outside the noise
>
>
> One concern with this patchset was around performance - specifically, measuring
> RLE impact separately from Matt Sealey's patches (CTZ & fast copy). I have done
> some additional benchmarking which I hope clarifies the benefits of each part
> of the patchset.
>
> Firstly, I've captured some memory via /dev/fmem from a Chromebook with many
> tabs open which is starting to swap, and then split this into 4178 4k pages.
> I've excluded the all-zero pages (as zram does), and also the no-zero pages
> (which won't tell us anything about RLE performance). This should give a
> realistic test dataset for zram. What I found was that the data is VERY
> bimodal: 44% of pages in this dataset contain 5% or fewer zeros, and 44%
> contain over 90% zeros (30% if you include the no-zero pages). This supports
> the idea of special-casing zeros in zram.
>
> Next, I've benchmarked four variants of lzo on these pages (on 64-bit Arm at
> max frequency): baseline LZO; baseline + Matt Sealey's patches (aka MS);
> baseline + RLE only; baseline + MS + RLE. Numbers are for weighted roundtrip
> throughput (the weighting reflects that zram does more compression than
> decompression).
>
> https://drive.google.com/file/d/1VLtLjRVxgUNuWFOxaGPwJYhl_hMQXpHe/view?usp=sharing
>
> Matt's patches help in all cases for Arm (and no effect on Intel), as expected.
>
> RLE also behaves as expected: with few zeros present, it makes no difference;
> above ~75%, it gives a good improvement (50 - 300 MB/s on top of the benefit
> from Matt's patches).
>
> Best performance is seen with both MS and RLE patches.
>
> Finally, I have benchmarked the same dataset on an x86-64 device. Here, the
> MS patches make no difference (as expected); RLE helps, similarly as on Arm.
> There were no definite regressions; allowing for observational error, 0.1%
> (3/4178) of cases had a regression > 1 standard deviation, of which the largest
> was 4.6% (1.2 standard deviations). I think this is probably within the noise.
>
> https://drive.google.com/file/d/1xCUVwmiGD0heEMx5gcVEmLBI4eLaageV/view?usp=sharing
>
> One point to note is that the graphs show RLE appears to help very slightly
> with no zeros present! This is because the extra code causes the clang
> optimiser to change code layout in a way that happens to have a significant
> benefit. Taking baseline LZO and adding a do-nothing line like
> "__builtin_prefetch(out_len);" immediately before the "goto next" has the same
> effect. So this is a real, but basically spurious effect - it's small enough
> not to upset the overall findings.
>
> Dave
>
>