Re: [PATCH v3] xfs: introduce protection for drop nlink

From: Dave Chinner
Date: Mon Sep 18 2023 - 01:51:40 EST


On Mon, Sep 18, 2023 at 11:44:53AM +0800, cheng.lin130@xxxxxxxxxx wrote:
> > On Fri, Sep 15, 2023 at 05:50:56PM +0800, cheng.lin130@xxxxxxxxxx wrote:
> > > > On Wed, Sep 13, 2023 at 05:44:45PM +0800, cheng.lin130@xxxxxxxxxx wrote:
> > > > > From: Cheng Lin <cheng.lin130@xxxxxxxxxx>
> > > > >
> > > > > When abnormal drop_nlink are detected on the inode,
> > > > > shutdown filesystem, to avoid corruption propagation.
> > > > >
> > > > > Signed-off-by: Cheng Lin <cheng.lin130@xxxxxxxxxx>
> > > > > ---
> > > > > fs/xfs/xfs_inode.c | 9 +++++++++
> > > > > 1 file changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 9e62cc500..40cc106ae 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -919,6 +919,15 @@ xfs_droplink(
> > > > > xfs_trans_t *tp,
> > > > > xfs_inode_t *ip)
> > > > > {
> > > > > +
> > > > > + if (VFS_I(ip)->i_nlink == 0) {
> > > > > + xfs_alert(ip->i_mount,
> > > > > + "%s: Deleting inode %llu with no links.",
> > > > > + __func__, ip->i_ino);
> > > > > + tp->t_flags |= XFS_TRANS_DIRTY;
> > > > Marking the transaction dirty is not necessary.
> > > > Otherwise this seems fine.
> > > Another strategy:
> > > Set nlink to an invalid value(like XFS_NLINK_PINNED), and
> > > Complete this transaction before shutdown fs. To make sure
> > > nlink not be zero. If the nlink of a directory are zero, it may
> > > be cleaned up.
> > > Is that appropriate?
> > No, all I'm asking you to do is drop dirtying of the transaction
> > from this patch because it is a) unnecessary and b) a layering
> > violation.
> > It is unnecessary because the transaction will almost always be
> > dirty before we get to xfs_droplink(). In the cases where it isn't
> > dirty (e.g. xfs_remove() on a directory) we explicitly check that
> > nlink == 2 before proceeding to call xfs_droplink(). Hence we can't
> > actually get to xfs_droplink() with a clean transaction, and so
> If the corrupted inode is a parent directory, when remove its
> subdirectory, the parent's nlink will be decreased to 0. But the
> transaction of subdirectory removing is not dirty (There are not
> check about the parent directory). In this situation, the transaction
> will be failed and the filesystem will be alive.

Yes, and that's perfectly fine. The transaction cancelling code has
worked this way for the past 20 years or so...

Indeed, you said your customer wants the system to stay alive if possible,
right? Well, so do we.

If the parent directory has a bogus nlink count, and that prevents
us from removing items from the directory, then as long as we
haven't dirtied anything and we can return a -EFSCORRUPTED error to
userspace to say the unlink failed and we don't have to shut the
filesystem down. All we now have is a directory that has objects in
it that can't be removed....

For a higher level perspective, we only need to shut the filesystem
down if we cannot safely back out of the modification operation that
was requested. Whilst the transaction is clean, we can safely return
errors to userspace and continue operation because everything in
memory and on disk is still consistent, even if we have found a
corruption in non-crtical the metadata. Just returning an error to
userspace can't make the problem any worse.

This also is how we treat corruption that is found during read
operations - we return -EFSCORRUPTED to userspace because something in
the directory or inode we were trying to read from was corrupted. We
do not need to shut down the filesystem because there is
no risk of making things worse or the in-memory filesystem state
getting out of sync with the on-disk state.

It is only when we are trying to modify something that corruption
becomes a problem with fatal consequences. Once we've made a
modification, the in-memory state is different to the on-disk state
and whilst we are in that state any corruption we discover becomes
fatal. That is because there is no way to reconcile the changes
we've already made in memory with what is on-disk - we don't know
that the in-memory changes are good because we tripped over
corruption, and so we must not propagate bad in-memory state and
metadata to disk over the top of what may be still be uncorrupted
metadata on disk.

This "in memory matches on disk" state is effectively what the dirty
flag in the transaction tracks, and it's done as part of the normal
running of a transaction as items are tagged for logging. Marking a
transaction dirty that has nothign tagged for logging is actually an
incorrect state; we may handle it correctly, but it should never
actually occur and we should definitely not be open coding dirtying
of transactions to create this state.

IOWs, the transaction modification error handling paths already do
the right thing according to the state carried by the transaction at
the time the error was encountered.

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx