Re: [RFC PATCH 6/9] mm: zswap: drop support for non-zero same-filled pages handling

From: Johannes Weiner
Date: Fri Mar 29 2024 - 17:18:06 EST


On Fri, Mar 29, 2024 at 11:56:46AM -0700, Yosry Ahmed wrote:
> > I perf'd zswapping out data that is 10% same-filled and 90% data that
> > always needs compression. It does nothing but madvise(MADV_PAGEOUT),
> > and the zswap_store() stack is already only ~60% of the cycles.
> >
> > Using zsmalloc + zstd, this is the diff between vanilla and my patch:
> >
> > # Baseline Delta Abs Shared Object Symbol
> > # ........ ......... .................... .....................................................
> > #
> > 4.34% -3.02% [kernel.kallsyms] [k] zswap_store
> > 11.07% +1.41% [kernel.kallsyms] [k] ZSTD_compressBlock_doubleFast
> > 15.55% +0.91% [kernel.kallsyms] [k] FSE_buildCTable_wksp
> >
> > As expected, we have to compress a bit more; on the other hand we're
> > removing the content scan for same-filled for 90% of the pages that
> > don't benefit from it. They almost amortize each other. Let's round it
> > up and the remaining difference is ~1%.
>
> Thanks for the data, this is very useful.
>
> There is also the load path. The original patch that introduced
> same-filled pages reports more gains on the load side, which also
> happens to be more latency-sensitive. Of course, the data could be
> outdated -- but it's a different type of workload than what will be
> running in a data center fleet AFAICT.
>
> Is there also no noticeable difference on the load side in your data?

9.40% +2.51% [kernel.kallsyms] [k] ZSTD_safecopy
0.76% -0.48% [kernel.kallsyms] [k] zswap_load

About 2% net.

Checking other compression algorithms, lzo eats 27.58%, and lz4
13.82%. The variance between these alone makes our "try to be smarter
than an actual compression algorithm" code look even sillier.

> Also, how much increase did you observe in the size of compressed data
> with your patch? Just curious about how zstd ended up handling those
> pages.

Checking zsmalloc stats, it did pack the same-filled ones down into 32
byte objects. So 0.7% of their original size. That's negligible, even
for workloads that have an unusually high share of them.

> > It's difficult to make the case that this matters to any real
> > workloads with actual think time in between paging.
>
> If the difference in performance is 1%, I surely agree.
>
> The patch reported 19-32% improvement in store time for same-filled
> pages depending on the workload and platform. For 10% same-filled
> pages, this should roughly correspond to 2-3% overall improvement,
> which is not too far from the 1% you observed.

Right.

> The patch reported much larger improvement for load times (which
> matters more), 49-85% for same-filled pages. If this corresponds to
> 5-8% overall improvement in zswap load time for a workload with 10%
> same-filled pages, that would be very significant. I don't expect to
> see such improvements tbh, but we should check.

No, I'm seeing around 2% net.

> > But let's say you do make the case that zero-filled pages are worth
> > optimizing for.
>
> I am not saying they are for sure, but removing the same-filled
> checking didn't seem to improve performance much in my testing, so the
> cost seems to be mostly in maintenance. This makes it seem to me that
> it *could* be a good tradeoff to only handle zero-filled pages. We can
> make them slightly faster and we can trim the complexity -- as shown
> by this patch.

In the original numbers, there was a certain percentage of non-zero
same-filled pages. You still first have to find that number for real
production loads to determine what the tradeoff actually is.

And I disagree that the code is less complex. There are fewer lines to
the memchr_inv() than to the open-coded word-scan, but that scan is
dead simple, self-contained and out of the way.

So call that a wash in terms of maintenance burden, but for a
reduction in functionality and a regression risk (however small).

The next patch to store them as special xarray entries is also a wash
at best. It trades the entry having an implicit subtype for no type at
all. zswap_load_zero_filled() looks like it's the fast path, and
decompression the fallback; the entry destructor is now called
"zswap_tree_free_element" and takes a void pointer. It also re-adds
most of the lines deleted by the zero-only patch.

I think this is actually a bit worse than the status quo.

But my point is, this all seems like a lot of opportunity cost in
terms of engineering time, review bandwidth, regression risk, and
readability, hackability, reliability, predictability of the code -
for gains that are still not shown to actually matter in practice.

https://lore.kernel.org/all/20240328122352.a001a56aed97b01ac5931998@xxxxxxxxxxxxxxxxxxxx/

This needs numbers to show not just that your patches are fine, but
that any version of this optimization actually matters for real
workloads. Without that, my vote would be NAK.