Re: [PATCH] xfs: Fix false positive lockdep warning with sb_internal & fs_reclaim

From: Dave Chinner
Date: Mon Jan 06 2020 - 16:01:35 EST


On Mon, Jan 06, 2020 at 11:12:32AM -0500, Waiman Long wrote:
> On 1/3/20 9:36 PM, Dave Chinner wrote:
> > On Thu, Jan 02, 2020 at 10:52:08AM -0500, Waiman Long wrote:
> >> Depending on the workloads, the following circular locking dependency
> >> warning between sb_internal (a percpu rwsem) and fs_reclaim (a pseudo
> >> lock) may show up:
....
> >> IOWs, this is a false positive, caused by the fact that
> >> xfs_trans_alloc() is called from both above and below memory reclaim
> >> as well as within /every level/ of freeze processing. Lockdep is
> >> unable to describe the staged flush logic in the freeze process that
> >> prevents deadlocks from occurring, and hence we will pretty much
> >> always see false positives in the freeze path....
> >>
> >> Perhaps breaking the fs_reclaim pseudo lock into a per filesystem lock
> >> may fix the issue. However, that will greatly complicate the logic and
> >> may not be worth it.
> > ANd it won't work, because now we'll just get lockedp warnings on
> > the per-fs reclaim lockdep context.
>
> It may or may not work depending on how we are going to break it up. I
> haven't thought through that alternative yet as I am expecting that it
> will be a bigger change if we decide to go this route.

A per-filesystem lock will not work because a single XFS filesystem
can trigger this "locking" inversion -both contexts that lockdep
warns about occur in the normal operation of that single filesystem.

The only way to avoid this is to have multiple context specific
reclaim lockdep contexts per filesystem, and that becomes a mess
really quickly. The biggest problem with this is that the "lock
context" is no longer validated consistently across the entire
filesystem - we con only internally validate the order or each lock
context against itself, and not against operations in the other lock
contexts. Hence there's no global validation of lock orders -
lockdep allows different lock orders in different contexts and so
that defeats the purpose of using lockdep for this validation.

Indeed, we've been down this path before with lockdep for XFS inode
locking vs inode reclaim(*), and we removed it years ago because
multiple lock contexts for different operations and/or life-cycle
stages just hasn't been reliable or maintainable. We still get false
positives because we haven't solved the "lockdep can't express the
dependencies fully" problem, yet we've reduced the lock order
validation scope of lockdep considerably....

> >> Another way to fix it is to disable the taking of the fs_reclaim
> >> pseudo lock when in the freezing code path as a reclaim on the freezed
> >> filesystem is not possible as stated above. This patch takes this
> >> approach by setting the __GFP_NOLOCKDEP flag in the slab memory
> >> allocation calls when the filesystem has been freezed.
> > IOWs, "fix" it by stating that "lockdep can't track freeze
> > dependencies correctly"?
> The lockdep code has a singular focus on spotting possible deadlock
> scenarios from a locking point of view.

IMO, lockdep only works for very simplistic locking strategies.
Anything complex requires more work to get lockdep annotations
"correct enough" to prevent false positives than it does to actually
read the code and very the locking is correct.

Have you noticed we do runs of nested trylock-and-back-out-on-
failure because we lock objects in an order that might deadlock
because of cross-object state dependencies that can't be covered by
lockdep? e.g. xfs_lock_inodes() which nests up to 5 inodes deep,
can nest 3 separate locks per inode and has to take into account
journal flushing depenedencies when locking multiple inodes?

Lockdep is very simplisitic and the complex annotations we need to
handle situations like the above are very difficult to design,
use and maintainr. It's so much simpler just to use big hammers like
GFP_NOFS to shut up all the different types of false positives
lockdep throws up for reclaim context false positives because after
all these years there still isn't a viable mechanism to easily
express this sort of complex dependency chain.

> The freeze dependency has to be
> properly translated into appropriate locking sequences to make lockdep
> work correctly.i

This is part of the problem - freeze context is not actually a lock
but it's close enough that freezing on simple filesystems can be
validated with lockdep annotations. i.e. same basic concept as the
lockdep reclaim context. However, it just doesn't work reliably for
more complex subsystems where there are much more subtle and complex
behavioural interactions and dependencies that a single lock context
cannot express or be annotated to express. That's where lockdep
falls down....

> I would say that the use of a global fs_reclaim pseudo
> lock is not a perfect translation and so it has exception cases we need
> to handle.

Exactly the problem.

> > Nope. We are getting rid of kmem_alloc wrappers and all the
> > associated flags, not adding new ones. Part of that process is
> > identifying all the places we currently use KM_NOFS to "shut up
> > lockdep" and converting them to explicit __GFP_NOLOCKDEP flags.
> >
> > So right now, this change needs to be queued up behind the API
> > changes that are currently in progress...
>
> That is fine. I can wait and post a revised patch after that. Who is
> going to make these changes?

https://lore.kernel.org/linux-xfs/20191120104425.407213-1-cmaiolino@xxxxxxxxxx/

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx