Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
From: Liam R. Howlett
Date: Thu Feb 15 2024 - 16:08:19 EST
* Jan Kara <jack@xxxxxxx> [240215 12:16]:
> On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > * Jan Kara <jack@xxxxxxx> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > >
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > >
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > >
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > >
> > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > >
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> >
> > If you are stopping, you should be pausing the iteration. Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue. We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
>
> I see. But we have versions of these structures that have locking external
> to the structure itself, don't we?
Ah, I do have them - but I don't want to propagate its use as the dream
is that it can be removed.
> Then how do you imagine serializing the
> background operations like compaction? As much as I agree your argument is
> "theoretically clean", it seems a bit like a trap and there are definitely
> xarray users that are going to be broken by this (e.g.
> tag_pages_for_writeback())...
I'm not sure I follow the trap logic. There are locks for the data
structure that need to be followed for reading (rcu) and writing
(spinlock for the maple tree). If you don't correctly lock the data
structure then you really are setting yourself up for potential issues
in the future.
The limitations are outlined in the documentation as to how and when to
lock. I'm not familiar with the xarray users, but it does check for
locking with lockdep, but the way this is written bypasses the lockdep
checking as the locks are taken and dropped without the proper scope.
If you feel like this is a trap, then maybe we need to figure out a new
plan to detect incorrect use?
Looking through tag_pages_for_writeback(), it does what is necessary to
keep a safe state - before it unlocks it calls xas_pause(). We have the
same on maple tree; mas_pause(). This will restart the next operation
from the root of the tree (the root can also change), to ensure that it
is safe.
If you have other examples you think are unsafe then I can have a look
at them as well.
You can make the existing code safe by also calling xas_pause() before
the rcu lock is dropped, but that is essentially what Chuck has done in
the maple tree conversion by using mt_find().
Regarding compaction, I would expect the write lock to be taken to avoid
any writes happening while compaction occurs. Readers would use rcu
locking to ensure they return either the old or new value. During the
write critical section, other writers would be in a
"mas_pause()/xas_pause()" state - so once they continue, they will
re-start the walk to the next element in the new tree.
If external locks are used, then compaction would be sub-optimal as it
may unnecessarily hold up readers, or partial write work (before the
point of a store into the maple tree).
Thanks,
Liam