Re: [PATCH 2/2] xfs: make sure link path does not go away at access

From: Ian Kent
Date: Sat Nov 13 2021 - 00:18:25 EST


On Fri, 2021-11-12 at 06:47 -0500, Brian Foster wrote:
> On Fri, Nov 12, 2021 at 07:10:19AM +0800, Ian Kent wrote:
> > On Thu, 2021-11-11 at 11:08 -0500, Brian Foster wrote:
> >
> > Hi Brian,
> >
> > > On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:
> > > > When following a trailing symlink in rcu-walk mode it's
> > > > possible to
> > > > succeed in getting the ->get_link() method pointer but the link
> > > > path
> > > > string be deallocated while it's being used.
> > > >
> > > > Utilize the rcu mechanism to mitigate this risk.
> > > >
> > > > Suggested-by: Miklos Szeredi <miklos@xxxxxxxxxx>
> > > > Signed-off-by: Ian Kent <raven@xxxxxxxxxx>
> > > > ---
> > > >  fs/xfs/kmem.h      |    4 ++++
> > > >  fs/xfs/xfs_inode.c |    4 ++--
> > > >  fs/xfs/xfs_iops.c  |   10 ++++++++--
> > > >  3 files changed, 14 insertions(+), 4 deletions(-)
> > > >
> > > ...
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > index a607d6aca5c4..2977e19da7b7 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -524,11 +524,17 @@ xfs_vn_get_link_inline(
> > > >  
> > > >         /*
> > > >          * The VFS crashes on a NULL pointer, so return -
> > > > EFSCORRUPTED if
> > > > -        * if_data is junk.
> > > > +        * if_data is junk. Also, if the path walk is in rcu-
> > > > walk
> > > > mode
> > > > +        * and the inode link path has gone away due inode re-
> > > > use
> > > > we have
> > > > +        * no choice but to tell the VFS to redo the lookup.
> > > >          */
> > > > -       link = ip->i_df.if_u1.if_data;
> > > > +       link = rcu_dereference(ip->i_df.if_u1.if_data);
> > > > +       if (!dentry && !link)
> > > > +               return ERR_PTR(-ECHILD);
> > > > +
> > >
> > > One thing that concerns me slightly about this approach is that
> > > inode
> > > reuse does not necessarily guarantee that if_data is NULL. It
> > > seems
> > > technically just as possible (even if exceedingly unlikely) for
> > > link
> > > to
> > > point at newly allocated memory since the previous sequence count
> > > validation check. The inode could be reused as another inline
> > > symlink
> > > for example, though it's not clear to me if that is really a
> > > problem
> > > for
> > > the vfs (assuming a restart would just land on the new link
> > > anyways?).
> > > But the inode could also be reallocated as something like a
> > > shortform
> > > directory, which means passing directory header data or whatever
> > > that
> > > it
> > > stores in if_data back to pick_link(), which is then further
> > > processed
> > > as a string.
> >
> > This is the sort of feedback I was hoping for.
> >
> > This sounds related to the life-cycle of xfs inodes and re-use.
> > Hopefully someone here on the list can enlighten me on this.
> >
> > The thing that comes to mind is that the inode re-use would
> > need to occur between the VFS check that validates the inode
> > is still ok and the use of link string. I think that can still
> > go away even with the above check.
> >
>
> Yeah... The original NULL ->get_link() problem was replicated with a
> small delay in the lookup path (specifically in the symlink
> processing
> path). This essentially widens the race window and allows a separate
> task to invalidate the dentry between the time the last dentry
> sequence
> validation occurred (and passed) and the attempt to call ->get_link()
> becomes imminent. I think patch 1 largely addresses this issue
> because
> we'll have revalidated the previous read of the function pointer
> before
> we attempt to call it.
>
> That leads to this patch, which suggests that even after the
> validation
> fix a small race window still technically exists with the -
> >get_link()
> code and inode teardown. In fact, it's not that hard to show that
> this
> is true by modifying the original reproducer to push the delay out
> beyond the check added by patch 1 (or into the ->get_link()
> callback).
> Playing around with that a bit, it's possible to induce a -
> >get_link()
> call to an inode that was reallocated as a shortform directory and
> returns a non-NULL if_data fork of that dir back to the vfs (to be
> interpreted as a symlink string). Nothing seems to explode on my
> quick
> test, fortunately, but I don't think that's an interaction we want to
> maintain.
>
> Of course one caveat to all of that is that after patch 1, the race
> window for that one might be so small as to make this impossible to
> reproduce in practice (whereas the problem fixed by patch 1 has been
> reproduced by users)...
>
> > Hopefully someone can clarify what happens here.
> >
> > >
> > > With that, I wonder why we wouldn't just return -ECHILD here like
> > > we
> > > do
> > > for the non-inline case to address the immediate problem, and
> > > then
> > > perhaps separately consider if we can rework bits of the
> > > reuse/reclaim
> > > code to allow rcu lookup of inline symlinks under certain
> > > conditions.
> >
> > Always switching to ref-walk mode would certainly resolve the
> > problem too, yes, perhaps we have no choice ...
> >
>
> Oh I don't think it's the only choice. I think Miklos' suggestion to
> use
> ->free_inode() is probably the right general approach. I just think a
> switch to ref-walk mode might be a good incremental step to fix this
> problem in a backportable way (s_op->free_inode() is newer relative
> to
> the introduction of _get_link_inline()). We can always re-enable rcu
> symlink processing once we get our inode teardown/reuse bits fixed up
> accordingly.. Just my .02.

Yes, I've had a change of heart on this too.

I think returning -ECHILD from xfs_vn_get_link_inline() is the
best solution.

There are a couple of reasons for that, the main one being the
link string can still go away while the VFS is using it, but
also Al has said more than once that switching to ref-walk mode
is not a big deal and that makes the problems vanish completely.
In any case references are taken at successful walk completion
anyway.

If it's found staying rcu-walk mode whenever possible is worth
while in cases like this then there's probably a lot more to do
to do this properly. The lockless stuff is tricky and error prone
(certainly it is for me) and side effects are almost always hiding
in unexpected places.

So as you say, that's something for another day.
I'll update the patch and post an update.

Ian
>
> Brian
>
> > Ian
> > >
> > > FWIW, I'm also a little curious why we don't set i_link for
> > > inline
> > > symlinks. I don't think that addresses this validation problem,
> > > but
> > > perhaps might allow rcu lookups in the inline symlink common case
> > > where
> > > things don't change during the lookup (and maybe even eliminate
> > > the
> > > need
> > > for this custom inline callback)..?
> > >
> > > Brian
> > >
> > > >         if (XFS_IS_CORRUPT(ip->i_mount, !link))
> > > >                 return ERR_PTR(-EFSCORRUPTED);
> > > > +
> > > >         return link;
> > > >  }
> > > >  
> > > >
> > > >
> > >
> >
> >
>