Re: [PATCH] mm: zswap: shrink until can accept

From: Domenico Cerasuolo
Date: Thu Jun 01 2023 - 11:28:47 EST


On Wed, May 31, 2023 at 3:06 AM Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote:
> > > Maybe ENOMEM is a bad example. How about if the swap device
> > > just went bad and can't complete new IO writes?
> >
> > This is actually outside the scope of zswap, and handled by the
> > swapcache (end_swap_bio_write).
> >
> > Once the IO is submitted, zswap will ax its copy and leave the rest to
> > the swapcache. It behaves the same way as if zswap had never been
> > involved to begin with when the swap out fails on IO errors.
> >
> > From a zswap perspective, there are no persistent errors in moving a
> > zswap entry back into the swapcache. Not just currently, but generally.
> Again, you are right that this zswap writeback is async.
> So the writeback error is NOT going to propagate to the
> shrink function.
>
> With the current three pool backends that I looked at{zbud,
> z3fold,zsmalloc} they all have internal retry 8 times.
> Adding more retry did not fundamentally change the existing
> behavior.
>
> I look at all the possible error codes generated inside
> the reclaim code, the only noticeable errors are ENOMEM
> and concurrent swap invalidation or a racing swapin fault.
>
> BTW, zswap reclaim consumes memory. Keep on looping ENOMEM
> might cause more OOM. But that can exist in current code
> as well.
>
> > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap
> > > > invalidation or a racing swapin fault. In both cases we should
> > > > absolutely keep trying other entries until the goal is met.
> > >
> > > How about a narrower fix recognizing those error cases and making
> > > the inner loop continue in those errors?
> >
> > Right, I just I don't really see the value proposition of this
> > complication, and I see some downsides (see below). No single entry
> > error should ever cause us to stop the wider reclaim loop.
>
> That is until the current LRU list has been through once.
> I expect repeating the same list yields less reclaimed pages.
>
> >
> > > > > > extreme case where it's the only page left on the list, I again don't
> > > > > > see how retrying a few times will make the situation worse.
> > > > > >
> > > > > > In practice, IMO there is little upside in trying to be more
> > > > > > discerning about the error codes. Simple seems better here.
> > > > >
> > > > > Just trying to think about what should be the precise loop termination
> > > > > condition here.
> > > > >
> > > > > I still feel blindly trying a few times is a very imprecise condition.
> > > >
> > > > The precise termination condition is when can_accept() returns true
> > > > again. The safety cap is only added as precaution to avoid infinite
> > > > loops if something goes wrong or unexpected, now or in the future.
> > >
> > > In my mind, that statement already suggests can_accept() is not
> > > *precise*, considering the avoid infinite loop.
> > > e.g. Do we know what is the optimal cap value and why that value
> > > is optical?
> >
> > Oh but it is precise. That's the goal we want to accomplish.
>
> I understand it is the goal, the precise condition I am
> talking about is the loop termination condition, can_accept()
> is not the only one. Anyway, let's move on.
> >
> > The cap is just so that in case something unexpectedly goes wrong (a
> > bug), we fail gracefully and don't lock up the machine. The same
> > reason we prefer WARN_ONs over BUG_ONs if we can, to avoid
> > crashes. That's really all there is too it, and it strikes me as
> > reasonable and robust design choice. It's fine to limp along or be
> > suboptimal after such a bug happens; the bar is avoiding an infinite
> > loop, nothing else.
> >
> > Your suggestion of whitelisting certain errors is more complicated,
> > but also less robust: in case an entry error does by some accident
> > become persistent for the whole LRU, we're locking up the host. We'd
> > rather catch a bug like this by seeing spikes in the reclaim failure
> > rate than by losing production machines.
> >
> > > Putting the definition of precise aside, I do see the unconditional
> > > retry can have unwanted effects.
> >
> > I hope I could address this above. But if not, please share your
> > concerns.
> Thanks for the discussion. I am less concerned about the retry now.
> Retry on EAGAIN might be the simplest way to proceed.
>
> Outside of the scope of this patch, I am still surprised to see
> such a high number of retries caused by race conditions. There are
> 8 inner loop retry already. The actual pages retried need to
> times 8.
>
> If there is a reproducer script, I want to local repo this
> to understand better. Wish there are ways to reduce the retry.

With `stress` you can reproduce the errors quite easily, if you're not going
to see many (if any at all) writebacks _without_ this patch, you can let the
zpool fill, keeping the default max_pool_percent value of 20, then once full
switch max_pool_percent to 1.

>
> Another idea is that we can start the shrinking once the
> pool max was reached. Try to reduce to the threshold earlier.

The shrinking already starts when pool max is reached, with and without the
proposed patch. If you're referring to a new threshold that would kick off
the shrinking, that could be a nice improvement to the overall mechanism,
and it would fit nicely on top of this patch.

>
> Chris