Re: new_inode_pseudo vs locked inode->i_state = 0

From: Mateusz Guzik
Date: Tue Aug 08 2023 - 20:24:04 EST


On 8/9/23, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
>> Hello,
>>
>> new_inode_pseudo is:
>> struct inode *inode = alloc_inode(sb);
>>
>> if (inode) {
>> spin_lock(&inode->i_lock);
>> inode->i_state = 0;
>> spin_unlock(&inode->i_lock);
>> }
>>
>> I'm trying to understand:
>> 1. why is it zeroing i_state (as opposed to have it happen in
>> inode_init_always)
>> 2. why is zeroing taking place with i_lock held
>>
>> The inode is freshly allocated, not yet added to the hash -- I would
>> expect that nobody else can see it.
>
> Maybe not at this point, but as soon as the function returns with
> the new inode, it could be published in some list that can be
> accessed concurrently and then the i_state visible on other CPUs
> better be correct.
>
> I'll come back to this, because the answer lies in this code:
>
>> Moreover, another consumer of alloc_inode zeroes without bothering to
>> lock -- see iget5_locked:
>> [snip]
>> struct inode *new = alloc_inode(sb);
>>
>> if (new) {
>> new->i_state = 0;
>> [/snip]
>
> Yes, that one is fine because the inode has not been published yet.
> The actual i_state serialisation needed to publish the inode happens
> in the function called in the very next line - inode_insert5().
>
> That does:
>
> spin_lock(&inode_hash_lock);
>
> .....
> /*
> * Return the locked inode with I_NEW set, the
> * caller is responsible for filling in the contents
> */
> spin_lock(&inode->i_lock);
> inode->i_state |= I_NEW;
> hlist_add_head_rcu(&inode->i_hash, head);
> spin_unlock(&inode->i_lock);
> .....
>
> spin_unlock(&inode_hash_lock);
>
> The i_lock is held across the inode state initialisation and hash
> list insert so that if anything finds the inode in the hash
> immediately after insert, they should set an initialised value.
>
> Don't be fooled by the inode_hash_lock here. We have
> find_inode_rcu() which walks hash lists without holding the hash
> lock, hence if anything needs to do a state check on the found
> inode, they are guaranteed to see I_NEW after grabbing the i_lock....
>
> Further, inode_insert5() adds the inode to the superblock inode
> list, which means concurrent sb inode list walkers can also see this
> inode whilst the inode_hash_lock is still held by inode_insert5().
> Those inode list walkers *must* see I_NEW at this point, and they
> are guaranteed to do so by taking i_lock before checking i_state....
>
> IOWs, the initialisation of inode->i_state for normal inodes must be
> done under i_lock so that lookups that occur after hash/sb list
> insert are guaranteed to see the correct value.
>
> If we now go back to new_inode_pseudo(), we see one of the callers
> is new_inode(), and it does this:
>
> struct inode *new_inode(struct super_block *sb)
> {
> struct inode *inode;
>
> spin_lock_prefetch(&sb->s_inode_list_lock);
>
> inode = new_inode_pseudo(sb);
> if (inode)
> inode_sb_list_add(inode);
> return inode;
> }
>
> IOWs, the inode is immediately published on the superblock inode
> list, and so inode list walkers can see it immediately. As per
> inode_insert5(), this requires the inode state to be fully
> initialised and memory barriers in place such that any walker will
> see the correct value of i_state. The simplest, safest way to do
> this is to initialise i_state under the i_lock....
>

Thanks for the detailed answer, I do think you have a valid point but
I don't think it works with the given example. ;)

inode_sb_list_add is:
spin_lock(&inode->i_sb->s_inode_list_lock);
list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
spin_unlock(&inode->i_sb->s_inode_list_lock);

... thus i_state is published by the time it unlocks.

According to my grep all iterations over the list hold the
s_inode_list_lock, thus they are guaranteed to see the update, making
the release fence in new_inode_pseudo redundant for this case.

With this in mind I'm assuming the fence was there as a safety
measure, for consumers which would maybe need it.

Then the code can:
struct inode *inode = alloc_inode(sb);

if (inode) {
inode->i_state = 0;
/* make sure i_state update will be visible before we insert
* the inode anywhere */
smp_wmb();
}

Upshots:
- replaces 2 atomics with a mere release fence, which is way cheaper
to do everywhere and virtually free on x86-64
- people reading the code don't wonder who on earth are we locking against

All that said, if the (possibly redundant) fence is literally the only
reason for the lock trip, I would once more propose zeroing in
inode_init_always:
diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..ce9664c4efe9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -232,6 +232,13 @@ int inode_init_always(struct super_block *sb,
struct inode *inode)
return -ENOMEM;
this_cpu_inc(nr_inodes);

+ inode->i_state = 0;
+ /*
+ * Make sure i_state update is visible before this inode gets inserted
+ * anywhere.
+ */
+ smp_wmb();
+
return 0;
}
EXPORT_SYMBOL(inode_init_always);

This is more in the spirit of making sure everybody has published
i_state = 0 and facilitates cleanup.
- new_inode_pseudo is now just alloc_inode
- confusing unlocked/unfenced i_state = 0 disappears from iget5_locked

And probably some more tidyups.

Now, I'm not going to flame with anyone over doing smp_wmb instead of
the lock trip (looks like a no-brainer to me, but I got flamed for
another one earlier today ;>).

I am however going to /strongly suggest/ that a comment explaining
what's going on is added there, if the current state is to remain.

As far as I'm concerned *locking* when a mere smp_wmb would sufficne
is heavily misleading and should be whacked if only for that reason.

Cheers,
--
Mateusz Guzik <mjguzik gmail.com>