Re: new_inode_pseudo vs locked inode->i_state = 0

From: Mateusz Guzik
Date: Wed Aug 09 2023 - 05:11:54 EST


On 8/9/23, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Wed, Aug 09, 2023 at 02:23:59AM +0200, Mateusz Guzik wrote:
>> On 8/9/23, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
>> >> Hello,
>> >>
>> >> new_inode_pseudo is:
>> >> struct inode *inode = alloc_inode(sb);
>> >>
>> >> if (inode) {
>> >> spin_lock(&inode->i_lock);
>> >> inode->i_state = 0;
>> >> spin_unlock(&inode->i_lock);
>> >> }
>> >>
>> >> I'm trying to understand:
>> >> 1. why is it zeroing i_state (as opposed to have it happen in
>> >> inode_init_always)
>> >> 2. why is zeroing taking place with i_lock held
>> >>
>> >> The inode is freshly allocated, not yet added to the hash -- I would
>> >> expect that nobody else can see it.
>> >
>> > Maybe not at this point, but as soon as the function returns with
>> > the new inode, it could be published in some list that can be
>> > accessed concurrently and then the i_state visible on other CPUs
>> > better be correct.
>> >
>> > I'll come back to this, because the answer lies in this code:
>> >
>> >> Moreover, another consumer of alloc_inode zeroes without bothering to
>> >> lock -- see iget5_locked:
>> >> [snip]
>> >> struct inode *new = alloc_inode(sb);
>> >>
>> >> if (new) {
>> >> new->i_state = 0;
>> >> [/snip]
>> >
>> > Yes, that one is fine because the inode has not been published yet.
>> > The actual i_state serialisation needed to publish the inode happens
>> > in the function called in the very next line - inode_insert5().
>> >
>> > That does:
>> >
>> > spin_lock(&inode_hash_lock);
>> >
>> > .....
>> > /*
>> > * Return the locked inode with I_NEW set, the
>> > * caller is responsible for filling in the contents
>> > */
>> > spin_lock(&inode->i_lock);
>> > inode->i_state |= I_NEW;
>> > hlist_add_head_rcu(&inode->i_hash, head);
>> > spin_unlock(&inode->i_lock);
>> > .....
>> >
>> > spin_unlock(&inode_hash_lock);
>> >
>> > The i_lock is held across the inode state initialisation and hash
>> > list insert so that if anything finds the inode in the hash
>> > immediately after insert, they should set an initialised value.
>> >
>> > Don't be fooled by the inode_hash_lock here. We have
>> > find_inode_rcu() which walks hash lists without holding the hash
>> > lock, hence if anything needs to do a state check on the found
>> > inode, they are guaranteed to see I_NEW after grabbing the i_lock....
>> >
>> > Further, inode_insert5() adds the inode to the superblock inode
>> > list, which means concurrent sb inode list walkers can also see this
>> > inode whilst the inode_hash_lock is still held by inode_insert5().
>> > Those inode list walkers *must* see I_NEW at this point, and they
>> > are guaranteed to do so by taking i_lock before checking i_state....
>> >
>> > IOWs, the initialisation of inode->i_state for normal inodes must be
>> > done under i_lock so that lookups that occur after hash/sb list
>> > insert are guaranteed to see the correct value.
>> >
>> > If we now go back to new_inode_pseudo(), we see one of the callers
>> > is new_inode(), and it does this:
>> >
>> > struct inode *new_inode(struct super_block *sb)
>> > {
>> > struct inode *inode;
>> >
>> > spin_lock_prefetch(&sb->s_inode_list_lock);
>> >
>> > inode = new_inode_pseudo(sb);
>> > if (inode)
>> > inode_sb_list_add(inode);
>> > return inode;
>> > }
>> >
>> > IOWs, the inode is immediately published on the superblock inode
>> > list, and so inode list walkers can see it immediately. As per
>> > inode_insert5(), this requires the inode state to be fully
>> > initialised and memory barriers in place such that any walker will
>> > see the correct value of i_state. The simplest, safest way to do
>> > this is to initialise i_state under the i_lock....
>> >
>>
>> Thanks for the detailed answer, I do think you have a valid point but
>> I don't think it works with the given example. ;)
>>
>> inode_sb_list_add is:
>> spin_lock(&inode->i_sb->s_inode_list_lock);
>> list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
>> spin_unlock(&inode->i_sb->s_inode_list_lock);
>>
>> ... thus i_state is published by the time it unlocks.
>>
>> According to my grep all iterations over the list hold the
>> s_inode_list_lock, thus they are guaranteed to see the update, making
>> the release fence in new_inode_pseudo redundant for this case.
>
> I don't believe that is the case - the i_state modification is not
> within the critical region the s_inode_list_lock covers, nor is the
> cacheline i_state lies on referenced within the critical section.
> Hence there is no explicit ordering dependency created by
> inode_sb_list_add on the value of i_state.
>
> Your argument seems to be that we can rely on the side effect of
> some unrelated lock to provide ordered memory access - that's just
> really poor locking design and will result in unmaintainable code
> that gets broken without realising in the future.
>

In that spirit, *normally* when you add an object to some
list/whatever, there are probably other changes you made past i_state
= 0 (i.e. another fence is ultimately needed just before insertion).

Least error prone approach as far as I'm concerned would make sure the
inserting routine issues relevant fences -- that way consumers which
sneak in some updates prior to calling it are still covered just fine
and so happens the sb list iteration *is* covered in this manner. But
I guess that's not the expected way here.

I feel compelled to note that not issuing the fence in iget5_locked
and instead relying on hash insertion to sort it out is implementing
this approach, albeit internally it is explicitly done with i_lock.

> That's why the spin lock. It's *obviously correct*, and it doesn't
> require any of the other code that checks i_state to have to care
> about any lock or memory barrier mechanism other than taking
> i_lock...
>

It is obvious not *wrong* to do in sense of not breaking code, just
deeply confusing for the reader -- I don't know about you, if I see a
lock taken and released for what should be an object invisible to
other threads, immediately the question pops up if the inode *is* in
fact visible somewhere. Making sure i_state update is published by
starting with acquiring the lock on freshly allocated inode is not
something I thought of.

>> With this in mind I'm assuming the fence was there as a safety
>> measure, for consumers which would maybe need it.
>>
>> Then the code can:
>> struct inode *inode = alloc_inode(sb);
>>
>> if (inode) {
>> inode->i_state = 0;
>> /* make sure i_state update will be visible before we
>> insert
>> * the inode anywhere */
>> smp_wmb();
>> }
>
> AFAIA, that doesn't work by itself without a matching smp_rmb()
> prior to the i_state reader - memory barriers need to be paired for
> ordering to be valid. Hence this also seems to assume that we can
> rely on some other unrelated lock pairing to actually order memory
> accesses to i_state....
>

It has to be either smp_rmb or a consume barrier.

Spelling it out instead of having it implicitly done through
spin_unlock points out that it is redundant for the new_inode vs sb
allocation case, people just don't have strong reaction to plain
lock/unlock trips, whereas they get very worried about any explicitly
mentioned barriers (most justified for the latter, but they also
should be this way for the former).

>> Upshots:
>> - replaces 2 atomics with a mere release fence, which is way cheaper
>> to do everywhere and virtually free on x86-64
>> - people reading the code don't wonder who on earth are we locking
>> against
>
> Downsides:
> - memory barriers are hard to get right,
> - nobody will be able to look at the code and say "this is obviously
> correct".
> - random unpaired memory barriers in code end up making it
> unmaintainable.
> - impossible to test for correctness
>
[snip]
>> I am however going to /strongly suggest/ that a comment explaining
>> what's going on is added there, if the current state is to remain.
>
> If you really think that improves the code, send a patch....
>
>> As far as I'm concerned *locking* when a mere smp_wmb would sufficne
>> is heavily misleading and should be whacked if only for that reason.
>
> If it ain't broke, don't fix it.
>
> The code we have now is widely used (there are over a hundred
> callers of new_inode()) and we know it works correctly. Nobody is
> complaining that it is too slow, and generally speaking the overhead
> of this lock traversal is lost in the noise of all the other
> operations needed to be performed to initialise a new inode.
>

Maybe you have been around this code too long to see an outsider
perspective or maybe I did not do a good job spelling it out.

I'm not complaining about speed, I'm complaining that a standalone
lock trip there is *confusing*, especially when another consumer in
the same file does not do it and filesystems have *custom* allocation
routines -- is the inode visible to other threads?

I also noted the reasoning you initially gave is trivially satisified
with smp_wmb, but I'm definitely not going to push for it.

However, after more poking around, what I think what you were trying
to say is that whatever inode lookups/traversal always take the inode
lock to inspect i_state, thus zeroing enclosed by it makes it easier
to reason about it (but if that's the case why is iget5_locked
standing out and you seem fine with it?). Now that I justification I
can understand at least.

All that said, how about this (modulo wording):
diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..977e72942706 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1018,6 +1018,13 @@ struct inode *new_inode_pseudo(struct super_block *sb)
struct inode *inode = alloc_inode(sb);

if (inode) {
+ /*
+ * Make sure i_state update is visible before this inode gets
+ * inserted anywhere.
+ *
+ * Safe way for consumers to inspect the field is with i_lock
+ * held, match this for simplicity.
+ */
spin_lock(&inode->i_lock);
inode->i_state = 0;
spin_unlock(&inode->i_lock);
@@ -1285,6 +1292,11 @@ struct inode *iget5_locked(struct super_block
*sb, unsigned long hashval,
struct inode *new = alloc_inode(sb);

if (new) {
+ /*
+ * new_inode_pseudo takes a lock to zero i_state, here
+ * we safely skip it because hash insertion will take
+ * care of it
+ */
new->i_state = 0;
inode = inode_insert5(new, hashval, test, set, data);
if (unlikely(inode != new))


or patch iget5_locked to use new_inode_pseudo, but then I don't want
to be on a lookout for someone's microbench changing

--
Mateusz Guzik <mjguzik gmail.com>