Re: [PATCH 2/2] xfs: make sure link path does not go away at access

From: Ian Kent
Date: Mon Nov 15 2021 - 20:12:36 EST


On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:
> On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote:
> > On Mon, 15 Nov 2021 at 00:18, Dave Chinner <david@xxxxxxxxxxxxx>
> > wrote:
> > > I just can't see how this race condition is XFS specific and why
> > > fixing it requires XFS to sepcifically handle it while we ignore
> > > similar theoretical issues in other filesystems...
> >
> > It is XFS specific, because all other filesystems RCU free the in-
> > core
> > inode after eviction.
> >
> > XFS is the only one that reuses the in-core inode object and that
> > is
> > very much different from anything the other filesystems do and what
> > the VFS expects.
>
> Sure, but I was refering to the xfs_ifree issue that the patch
> addressed, not the re-use issue that the *first patch addressed*.
>
> > I don't see how clearing the quick link buffer in
> > ext4_evict_inode()
> > could do anything bad.  The contents are irrelevant, the lookup
> > will
> > be restarted anyway, the important thing is that the buffer is not
> > freed and that it's null terminated, and both hold for the ext4,
> > AFAICS.
>
> You miss the point (which, admittedly, probably wasn't clear).
>
> I suggested just zeroing the buffer in xfs_ifree instead of zeroing
> it, which you seemed to suggest wouldn't work and we should move the
> XFS functionality to .free_inode. That's what I was refering to as
> "not being XFS specific" - if it is safe for ext4 to zero the link
> buffer in .evict while lockless lookups can still be accessing the
> link buffer, it is safe for XFS to do the same thing in .destroy
> context.

I'll need to think about that for a while.

Zeroing the buffer while it's being used seems like a problem to
me and was what this patch was trying to avoid.

I thought all that would be needed for this to happen is for a
dentry drop to occur while the link walk was happening after
->get_link() had returned the pointer.

What have I got wrong in that thinking?

>
> If it isn't safe for ext4 to do that, then we have a general
> pathwalk problem, not an XFS issue. But, as you say, it is safe to
> do this zeroing, so the fix to xfs_ifree() is to zero the link
> buffer instead of freeing it, just like ext4 does.
>
> As a side issue, we really don't want to move what XFS does in
> .destroy_inode to .free_inode because that then means we need to add
> synchronise_rcu() calls everywhere in XFS that might need to wait on
> inodes being inactivated and/or reclaimed. And because inode reclaim
> uses lockless rcu lookups, there's substantial danger of adding rcu
> callback related deadlocks to XFS here. That's just not a direction
> we should be moving in.

Another reason I decided to use the ECHILD return instead is that
I thought synchronise_rcu() might add an unexpected delay.

Since synchronise_rcu() will only wait for processes that currently
have the rcu read lock do you think that could actually be a problem
in this code path?

>
> I'll also point out that this would require XFS inodes to pass
> through *two* rcu grace periods before the memory they hold could be
> freed because, as I mentioned, xfs inode reclaim uses rcu protected
> inode lookups and so relies on inodes to be freed by rcu callback...
>
> > I tend to agree with Brian and Ian at this point: return -ECHILD
> > from
> > xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
> > implications are fully dealt with.  No way to fix this from VFS
> > alone.
>
> I disagree from a fundamental process POV - this is just sweeping
> the issue under the table and leaving it for someone else to solve
> because the root cause of the inode re-use issue has not been
> identified. But to the person who architected the lockless XFS inode
> cache 15 years ago, it's pretty obvious, so let's just solve it now.

Sorry, I don't understand what you mean by the root cause not
being identified?

Until lockless path walking was introduced this wasn't a problem
because references were held during walks so there could never be
a final dput() to trigger freeing process during a walk. And a lot
of code probably still makes that assumption. And code that does
make that assumption should return -ECHILD in cases like this so
that the VFS can either legitimize the struct path (by taking
references) or restart the walk in ref-walk mode.

Can you elaborate please?

>
> With the xfs_ifree() problem solved by zeroing rather than freeing,
> then the only other problem is inode reuse *within an rcu grace
> period*. Immediate inode reuse tends to be rare, (we can actually
> trace occurrences to validate this assertion), and implementation
> wise reuse is isolated to a single function: xfs_iget_recycle().
>
> xfs_iget_recycle() drops the rcu_read_lock() inode lookup context
> that found the inode marks it as being reclaimed (preventing other
> lookups from finding it), then re-initialises the inode. This is
> what makes .get_link change in the middle of pathwalk - we're
> reinitialising the inode without waiting for the RCU grace period to
> expire.

Ok, good to know that, there's a lot of icache code to look
through, ;)

At this point I come back to thinking the original patch might
be sufficient. But then that's only for xfs and excludes
potential problems with other file systems so I'll not go
there.

>
> The obvious thing to do here is that after we drop the RCU read
> context, we simply call synchronize_rcu() before we start
> re-initialising the inode to wait for the current grace period to
> expire. This ensures that any pathwalk that may have found that
> inode has seen the sequence number change and droppped out of
> lockless mode and is no longer trying to access that inode.  Then we
> can safely reinitialise the inode as it has passed through a RCU
> grace period just like it would have if it was freed and
> reallocated.

Sounds right to me, as long as it is ok to call synchronize_rcu()
here.

>
> This completely removes the entire class of "reused inodes race with
> VFS level RCU walks" bugs from the XFS inode cache implementation,
> hence XFS inodes behave the same as all other filesystems w.r.t RCU
> grace period expiry needing to occur before a VFS inode is reused.
>
> So, it looks like three patches to fix this entirely:
>
> 1. the pathwalk link sequence check fix
> 2. zeroing the inline link buffer in xfs_ifree()

I'm sorry but I'm really having trouble understanding how this is
ok. If some process is using the buffer to walk a link path how
can zeroing the contents of the buffer be ok?

> 3. adding synchronize_rcu() (or some variant) to xfs_iget_recycle()
>
> Cheers,
>
> Dave.