Re: [PATCH 0/2] minimize swapping on zswap store failure

From: Nhat Pham
Date: Tue Oct 17 2023 - 15:03:29 EST

Next message: Alexei Starovoitov: "Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH"
Previous message: Namhyung Kim: "Re: [PATCH] perf build: Remove stray '\' before that is warned about since grep 3.8"
In reply to: Nhat Pham: "Re: [PATCH 0/2] minimize swapping on zswap store failure"
Next in thread: Nhat Pham: "Re: [PATCH 0/2] minimize swapping on zswap store failure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Oct 16, 2023 at 5:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > Currently, when a zswap store attempt fails, the page is immediately
> > swapped out. This could happen for a variety of reasons. For instance,
> > the compression algorithm could fail (such as when the data is not
> > compressible), or the backend allocator might not be able to find a
> > suitable slot for the compressed page. If these pages are needed
> > later on, users will incur IOs from swapins.
> >
> > This issue prevents the adoption of zswap for potential users who
> > cannot tolerate the latency associated with swapping. In many cases,
> > these IOs are avoidable if we just keep in memory the pages that zswap
> > fail to store.
> >
> > This patch series add two new features for zswap that will alleviate
> > the risk of swapping:
> >
> > a) When a store attempt fail, keep the page untouched in memory
> > instead of swapping it out.
>
> What about writeback when the zswap limit is hit? I understand the
> problem, but I am wondering if this is the correct way of fixing it.
> We really need to make zswap work without a backing swapfile, which I
> think is the correct way to fix all these problems. I was working on
> that, but unfortunately I had to pivot to something else before I had
> something that was working.
>
> At Google, we have "ghost" swapfiles that we use just to use zswap
> without a swapfile. They are sparse files, and we have internal kernel
> patches to flag them and never try to actually write to them.
>
> I am not sure how many bandaids we can afford before doing the right
> thing. I understand it's a much larger surgery, perhaps there is a way
> to get a short-term fix that is also a step towards the final state we
> want to reach instead?

Regarding the writeback - I'll make sure to also short-circuit writeback
when the bypass_swap option is enabled in v2 :) I'll probably send out
v2 after

I absolutely agree that we must decouple zswap and swap (and would
be happy to help out in any capacity I could - we have heard similar
concerns/complaints about swap wastage from internal parties as well).

However, as Johannes has pointed out, this feature still has its place,
given our already existing swapfile deployments. I do agree that a
global knob is insufficient tho. I'll add a per-cgroup knob in v2 so that
we can enable/disable this feature on a per-workload basis.

>
> >
> > b) If the store attempt fails at the compression step, allow the page
> > to be stored in its uncompressed form in the zswap pool. This maintains
> > the LRU ordering of pages, which will be helpful for accurate
> > memory reclaim (zswap writeback in particular).
>
> This is dangerous. Johannes and I discussed this before. This means
> that reclaim can end up allocating more memory instead of freeing.
> Allocations made in the reclaim path are made under the assumption
> that we will eventually free memory. In this case, we won't. In the
> worst case scenario, reclaim can leave the system/memcg in a worse
> state than before it started.
>
> Perhaps there is a way we can do this without allocating a zswap entry?
>
> I thought before about having a special list_head that allows us to
> use the lower bits of the pointers as markers, similar to the xarray.
> The markers can be used to place different objects on the same list.
> We can have a list that is a mixture of struct page and struct
> zswap_entry. I never pursued this idea, and I am sure someone will
> scream at me for suggesting it. Maybe there is a less convoluted way
> to keep the LRU ordering intact without allocating memory on the
> reclaim path.

Hmm yeah you're right about these concerns. That seems like a lot more
involved than what I envisioned initially.

Let's put this aside for now. I'll just send the first patch in v2, and we can
work on + discuss more about uncompressed pages storing later on.

>
> >
> > These features could be enabled independently via two new zswap module
> > parameters.
> >
> > Nhat Pham (2):
> > swap: allows swap bypassing on zswap store failure
> > zswap: store uncompressed pages when compression algorithm fails
> >
> > Documentation/admin-guide/mm/zswap.rst | 16 +++++++
> > include/linux/zswap.h | 9 ++++
> > mm/page_io.c | 6 +++
> > mm/shmem.c | 8 +++-
> > mm/zswap.c | 64 +++++++++++++++++++++++---
> > 5 files changed, 95 insertions(+), 8 deletions(-)
> >
> > --
> > 2.34.1

Next message: Alexei Starovoitov: "Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH"
Previous message: Namhyung Kim: "Re: [PATCH] perf build: Remove stray '\' before that is warned about since grep 3.8"
In reply to: Nhat Pham: "Re: [PATCH 0/2] minimize swapping on zswap store failure"
Next in thread: Nhat Pham: "Re: [PATCH 0/2] minimize swapping on zswap store failure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]