Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

From: Stephen Brennan
Date: Wed Feb 16 2022 - 02:44:03 EST


Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:
> On Wed, Feb 16, 2022 at 03:27:39AM +0000, Al Viro wrote:
>> On Tue, Feb 15, 2022 at 06:24:53PM -0800, Stephen Brennan wrote:
>>
>> > It seems to me that, if we had taken a reference on child by
>> > incrementing the reference count prior to unlocking it, then
>> > dentry_unlist could never have been called, since we would never have
>> > made it into __dentry_kill. child would still be on the list, and any
>> > cursor (or sweep_negative) list updates would now be reflected in
>> > child->d_child.next. But dput is definitely not safe while holding a
>> > lock on a parent dentry (even more so now thanks to my patch), so that
>> > is out of the question.
>> >
>> > Would dput_to_list be an appropriate solution to that issue? We can
>> > maintain a dispose list in d_walk and then for any dput which really
>> > drops the refcount to 0, we can handle them after d_walk is done. It
>> > shouldn't be that many dentries anyway.
>>
>> Interesting idea, but... what happens to behaviour of e.g.
>> shrink_dcache_parent()? You'd obviously need to modify the test in
>> select_collect(), but then the selected dentries become likely candidates
>> for d_walk() itself wanting to move them over to its internal shrink list.
>> OTOH, __dput_to_list() will just decrement the count and skip the sucker
>> if it's already on a shrink list...
>>
>> It might work, but it really needs a careful analysis wrt.
>> parallel d_walk(). What happens when you have two threads hitting
>> shrink_dcache_parent() on two different places, one being an ancestor
>> of another? That can happen in parallel, and currently it does work
>> correctly, but that's fairly delicate and there are places where a minor
>> change could turn O(n) into O(n^2), etc.
>>
>> Let me think about that - I'm not saying it's hopeless, and it
>> would be nice to avoid that subtlety in dentry_unlist(), but there
>> might be dragons.
>
> PS: another obvious change is that d_walk() would become blocking.
> So e.g.
>
> int path_has_submounts(const struct path *parent)
> {
> struct check_mount data = { .mnt = parent->mnt, .mounted = 0 };
>
> read_seqlock_excl(&mount_lock);
> d_walk(parent->dentry, &data, path_check_mount);
> read_sequnlock_excl(&mount_lock);
>
> return data.mounted;
> }
>
> would need a rework - d_walk() is under a spinlock here. Another
> potential headache in that respect is d_genocide() - currently non-blocking,
> with this change extremely likely to do evictions. That, however, is
> not a problem for current in-tree callers - they are all shortly followed
> by shrink_dcache_parent() or equivalents.
>
> path_has_submounts(), though... I'd really hate to reintroduce the
> "call this on entry/call this on exit" callbacks. Perhaps it would
> be better to pass the dispose list to d_walk() and have the callers
> deal with evictions? For that matter, shrink_dcache_parent() and
> friends would be just fine passing the same list they are collecting
> into.
>
> <looks at path_has_submounts() callers>
> *growl*
> autofs_d_automount() has it called under sbi->fs_lock. So we'd need
> to take the disposal all the way out there, and export shrink_dentry_list()
> while we are at it. Not pretty ;-/
>
> And no, we can't make the disposal async, so offloading it to a worker or
> thread is not feasible...

Hm, making d_walk blocking seems like a real barrier to some users that
use d_walk as a lightweight check (e.g. path_has_submounts) which isn't
sensitive to retries. But other callers like shrink_dcache_parent are
okay with blocking, and already have a dispose list, but they want to
avoid retries.

What if we gave d_walk an optional "dispose" list parameter? For those
that provide it, we take references to ensure that child doesn't get
killed out from under us. For those which don't, if child does get
killed, then they are forced to retry from the beginning of
this_parent->d_subdirs. This wouldn't eliminate the careful analysis
necessary for parallel d_walk when both provide dispose lists. But it
would avoid making all d_walk callers pay the penalty of becoming
blocking.

Here's a super rough sketch of what I mean (it's late at night, no
obvious syntax errors but it won't compile unless all the callers are
updated):

diff --git a/fs/dcache.c b/fs/dcache.c
index e98079ed86be..307d32f023c8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1408,11 +1408,12 @@ enum d_walk_ret {
* @enter: callback when first entering the dentry
*
* The @enter() callbacks are called with d_lock held.
*/
static void d_walk(struct dentry *parent, void *data,
- enum d_walk_ret (*enter)(void *, struct dentry *))
+ enum d_walk_ret (*enter)(void *, struct dentry *),
+ struct list_head *dispose)
{
struct dentry *this_parent;
struct list_head *next;
unsigned seq = 0;
enum d_walk_ret ret;
@@ -1483,23 +1484,30 @@ static void d_walk(struct dentry *parent, void *data,
ascend:
if (this_parent != parent) {
struct dentry *child = this_parent;
this_parent = child->d_parent;

- spin_unlock(&child->d_lock);
- spin_lock(&this_parent->d_lock);
+ if (dispose) {
+ child->d_lockref.count += 1;
+ spin_unlock(&child->d_lock);
+ spin_lock(&this_parent->d_lock);
+ dput_to_list(child, dispose);
+ } else {
+ spin_unlock(&child->d_lock);
+ spin_lock(&this_parent->d_lock);
+ }

/* might go back up the wrong parent if we have had a rename. */
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
- /* go into the first sibling still alive */
- do {
- next = child->d_child.next;
- if (next == &this_parent->d_subdirs)
- goto ascend;
- child = list_entry(next, struct dentry, d_child);
- } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
+
+ /* child was killed, its next pointers may now be stale, retry */
+ if (child->d_flags & DCACHE_DENTRY_KILLED)
+ goto killed_retry;
+
+ next = child->d_child.next;
+ child = list_entry(next, struct dentry, d_child);
rcu_read_unlock();
goto resume;
}
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
@@ -1516,10 +1524,18 @@ static void d_walk(struct dentry *parent, void *data,
BUG_ON(seq & 1);
if (!retry)
return;
seq = 1;
goto again;
+
+killed_retry:
+ rcu_read_unlock();
+ BUG_ON(dispose);
+ if (!retry)
+ return;
+ goto repeat;
+
}

struct check_mount {
struct vfsmount *mnt;
unsigned int mounted;