Re: [PATCH 1/2] kernfs: dont take i_lock on inode attr read

From: Ian Kent
Date: Thu Jul 27 2023 - 21:29:30 EST


On 28/7/23 09:06, Imran Khan wrote:
Hello Ian,

On 28/7/2023 10:16 am, Ian Kent wrote:
On 28/7/23 08:00, Ian Kent wrote:
On 27/7/23 12:30, Imran Khan wrote:
Hello Ian,
Sorry for late reply. I was about to reply this week.

On 27/7/2023 10:38 am, Ian Kent wrote:
On 20/7/23 10:03, Ian Kent wrote:
On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
[...]
I do see a problem with recent changes.

I'll send this off to Greg after I've done some testing (primarily just
compile and function).

Here's a patch which describes what I found.

Comments are, of course, welcome, ;)
Anders I was hoping you would check if/what lockdep trace

you get with this patch.


Imran, I was hoping you would comment on my change as it

relates to the kernfs_iattr_rwsem changes.


Ian

kernfs: fix missing kernfs_iattr_rwsem locking

From: Ian Kent <raven@xxxxxxxxxx>

When the kernfs_iattr_rwsem was introduced a case was missed.

The update of the kernfs directory node child count was also protected
by the kernfs_rwsem and needs to be included in the change so that the
child count (and so the inode n_link attribute) does not change while
holding the rwsem for read.

kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
these are getting invoked while adding (kernfs_add_one),
removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
operations proceed under kernfs_rwsem and I see each invocation of
kernfs_link/unlink_sibling during the above mentioned operations is happening
under kernfs_rwsem.
So the child count should still be protected by kernfs_rwsem and we should not
need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
Yes, that's exactly what I intended (assuming you mean write lock in those cases)

when I did it so now I wonder what I saw that lead to my patch, I'll need to look

again ...
Ahh, I see why I thought this ...

It's the hunk:

@@ -285,10 +285,10 @@ int kernfs_iop_permission(struct mnt_idmap *idmap,
        kn = inode->i_private;
        root = kernfs_root(kn);

-       down_read(&root->kernfs_rwsem);
+       down_read(&root->kernfs_iattr_rwsem);
        kernfs_refresh_inode(kn, inode);
        ret = generic_permission(&nop_mnt_idmap, inode, mask);
-       up_read(&root->kernfs_rwsem);
+       up_read(&root->kernfs_iattr_rwsem);

        return ret;
 }

which takes away the kernfs_rwsem and introduces the possibility of

the change. It may be more instructive to add back taking the read

lock of kernfs_rwsem in .permission() than altering the sibling link

and unlink functions, I mean I even caught myself on it.

Yes this was the block I referred to in my second comment [1]. However adding
back read lock of kernfs_rwsem in .permission() will again introduce the
bottleneck that I mentioned at [2]. So I think taking taking the locks in
link/unlink is the better option.

Yes, sorry, I always fall into the trap of not reading through all my

mail before commenting, oops!


The fact that .permission() is called so much more than the create/remove

functions also occurred to be me too just after I posted my comment (and

is probably why I originally did it that way).


I'll forward the patch for merge but would really like to see the lockdep

trace so I'll wait a little while ...

I understand having one lock to synchronize everything makes it easier
debug/development wise but sometimes (such as the case mentioned in [2]), it is
not optimum performance wise.

Indeed, the performance improvement work that has been happening over

quite some time now is very good.


I had seen some opportunities around the file open path long ago but

hadn't got to doing anything there as you have, the work looks good

to me, thanks for doing it.


Ian

Thoughts ?

Thanks,
Imran

[1]: https://lore.kernel.org/all/8b0a1619-1e39-fc3a-1226-f3b167e64646@xxxxxxxxxx/
[2]: https://lore.kernel.org/all/20230302043203.1695051-2-imran.f.khan@xxxxxxxxxx/
Ian


Kindly let me know your thoughts. I would still like to see new lockdep traces
with this change.
Indeed, I hope Anders can find time to get the trace.


Ian

Thanks,
Imran

Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
attributes)

Signed-off-by: Ian Kent <raven@xxxxxxxxxx>
Cc: Anders Roxell <anders.roxell@xxxxxxxxxx>
Cc: Imran Khan <imran.f.khan@xxxxxxxxxx>
Cc: Arnd Bergmann <arnd@xxxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: Eric Sandeen <sandeen@xxxxxxxxxxx>
---
   fs/kernfs/dir.c |    4 ++++
   1 file changed, 4 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 45b6919903e6..6e84bb69602e 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
*kn)
       rb_insert_color(&kn->rb, &kn->parent->dir.children);
         /* successfully added, account subdir number */
+ down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
       if (kernfs_type(kn) == KERNFS_DIR)
           kn->parent->dir.subdirs++;
       kernfs_inc_rev(kn->parent);
+    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
         return 0;
   }
@@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
kernfs_node *kn)
       if (RB_EMPTY_NODE(&kn->rb))
           return false;
   + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
       if (kernfs_type(kn) == KERNFS_DIR)
           kn->parent->dir.subdirs--;
       kernfs_inc_rev(kn->parent);
+    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
         rb_erase(&kn->rb, &kn->parent->dir.children);
       RB_CLEAR_NODE(&kn->rb);