Re: [PATCH v2] mm/swap: fix race when skipping swapcache

From: Chris Li
Date: Thu Feb 08 2024 - 14:42:40 EST

Next message: Conor Dooley: "Re: [PATCH 1/1] dt-bindings: pci: layerscape-pci: Convert to yaml file"
Previous message: Joe Damato: "Re: [PATCH net-next v3] net/mlx5e: link NAPI instances to queues and IRQs"
In reply to: Kairui Song: "Re: [PATCH v2] mm/swap: fix race when skipping swapcache"
Next in thread: Kairui Song: "Re: [PATCH v2] mm/swap: fix race when skipping swapcache"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Feb 8, 2024 at 11:01 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Thu, Feb 8, 2024 at 2:36 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >
> > Kairui Song <ryncsn@xxxxxxxxx> writes:
> >
> > > On Thu, Feb 8, 2024 at 2:31 AM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > >>
> > >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote:
> >
> > [snip]
> >
> > >> >
> > >> > So I think the thing is, it's getting complex because this patch
> > >> > wanted to make it simple and just reuse the swap cache flags.
> > >>
> > >> I agree that a simple fix would be the important at this point.
> > >>
> > >> Considering your description, here's my understanding of the other idea:
> > >> Other method, such as increasing the swap count, haven't proven effective
> > >> in your tests. The approach risk forcing racers to rely on the swap cache
> > >> again and the potential performance loss in race scenario.
> > >>
> > >> While I understand that simplicity is important, and performance loss
> > >> in this case may be infrequent, I believe swap_count approach could be a
> > >> suitable solution. What do you think?
> > >
> > > Hi Minchan
> > >
> > > Yes, my main concern was about simplicity and performance.
> > >
> > > Increasing swap_count here will also race with another process from
> > > releasing swap_count to 0 (swapcache was able to sync callers in other
> > > call paths but we skipped swapcache here).
> >
> > What is the consequence of the race condition?
>
> Hi Ying,
>
> It will increase the swap count of an already freed entry, this race
> with multiple swap free/alloc logic that checks if count ==
> SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry,
> all result in random corruption of the swap map. This happens a lot
> during stress testing.

In theory, the original code before your patch can get into a
situation similar to what you try to avoid.
CPU1 enters the do_swap_page() with entry swap count == 1 and
continues handling the swap fault without swap cache. Then some
operation bumps up the swap entry count and CPU2 enters the
do_swap_page() racing with CPU1 with swap count == 2. CPU2 will need
to go through the swap cache case. We still need to handle this
situation correctly.
So the complexity is already there.

If we can make sure the above case works correctly, then one way to
avoid the problem is just send the CPU2 to use the swap cache (without
the swap cache by-passing).

>
> >
> > > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still
> > > have swap_count == 1, bail out if not; 3. Set it to 2;
> > > __swap_duplicate can be modified to support this, it's similar to
> > > existing logics for SWAP_HAS_CACHE.
> > >
> > > And swap freeing path will do more things, swapcache clean up needs to
> > > be handled even in the bypassing path since the racer may add it to
> > > swapcache.
> > >
> > > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many
> > > overhead, so I used that way in this patch, the only issue is
> > > potentially repeated page faults now.
> > >
> > > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad
> > > at naming it) special value, so any racer can just spin on it to avoid
> > > all the problems, how do you think about this?
> >
> > Let's try some simpler method firstly.
>
> Another simpler idea is, add a schedule() or
> schedule_timeout_uninterruptible(1) in the swapcache_prepare failure
> path before goto out (just like __read_swap_cache_async). I think this
> should ensure in almost all cases, PTE is ready after it returns, also
> yields more CPU.

I mentioned in my earlier email and Ying points out that as well.
Looping waiting inside do_swap_page() is bad because it is holding
other locks. Sorry I started this idea but it seems no good. If we can
have CPU2 make forward progress without retrying the page fault would
be the best, if possible.

Chris

Next message: Conor Dooley: "Re: [PATCH 1/1] dt-bindings: pci: layerscape-pci: Convert to yaml file"
Previous message: Joe Damato: "Re: [PATCH net-next v3] net/mlx5e: link NAPI instances to queues and IRQs"
In reply to: Kairui Song: "Re: [PATCH v2] mm/swap: fix race when skipping swapcache"
Next in thread: Kairui Song: "Re: [PATCH v2] mm/swap: fix race when skipping swapcache"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]