Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently

From: Jan Kara
Date: Wed Dec 20 2017 - 11:42:06 EST


On Wed 20-12-17 09:03:06, Jeff Layton wrote:
> On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> > On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > > [PATCH] SQUASH: add memory barriers around i_version accesses
> >
> > Why explicit memory barriers rather than annotating the operations
> > with the required semantics and getting the barriers the arch
> > requires automatically? I suspect this should be using
> > atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> > the atomic_cmpxchg needs to have release semantics to match the
> > acquire semantics needed for the load of the current value.
> >
> > From include/linux/atomics.h:
> >
> > * For compound atomics performing both a load and a store, ACQUIRE
> > * semantics apply only to the load and RELEASE semantics only to the
> > * store portion of the operation. Note that a failed cmpxchg_acquire
> > * does -not- imply any memory ordering constraints.
> >
> > Memory barriers hurt my brain. :/
> >
> > At minimum, shouldn't the atomic op specific barriers be used rather
> > than full memory barriers? i.e:
> >
>
> They hurt my brain too. Yes, definitely atomic-specific barriers should
> be used here instead, since this is an atomic64_t now.
>
> After going over the docs again...my understanding has always been that
> you primarily need memory barriers to order accesses to different areas
> of memory.

That is correct.

> As Jan and I were discussing in the other thread, i_version is not
> synchronized with anything else. In this code, we're only dealing with a
> single 64-bit word. I don't think there are any races there wrt the API
> itself.

There are not but it is like saying that lock implementation is correct
because the lock state does not get corrupted ;). Who cares about protected
data...

> The "legacy" inode_inc_iversion() however _does_ have implied memory
> barriers from the i_lock. There could be some subtle de-facto ordering
> there, so I think we probably do want some barriers in here if only to
> preserve that. It's not likely to cost much, and may save us tracking
> down some fiddly bugs.
>
> What about this patch? Note that I've only added barriers to
> inode_maybe_inc_iversion. I don't see that we need it for the other
> functions, but please do tell me if I'm wrong there:
>
> --------------------8<---------------------------
>
> [PATCH] SQUASH: add memory barriers around i_version accesses
>
> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> ---
> include/linux/iversion.h | 53 +++++++++++++++++++++++++++++-------------------
> 1 file changed, 32 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..02187a3bec3b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
> atomic64_set(&inode->i_version, val);
> }
>
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + return atomic64_read(&inode->i_version);
> +}
> +
> /**
> * inode_set_iversion - set i_version to a particular value
> * @inode: inode to set
> @@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> {
> u64 cur, old, new;
>
> - cur = (u64)atomic64_read(&inode->i_version);
> + /*
> + * The i_version field is not strictly ordered with any other inode
> + * information, but the legacy inode_inc_iversion code used a spinlock
> + * to serialize increments.
> + *
> + * This code adds full memory barriers to ensure that any de-facto
> + * ordering with other info is preserved.
> + */
> + smp_mb__before_atomic();

This should be just smp_mb(). __before_atomic() pairs with atomic
operations like atomic_inc(). atomic_read() is completely unordered
operation (happens to be plain memory read on x86) and so __before_atomic()
is not enough.

> + cur = inode_peek_iversion_raw(inode);
> for (;;) {
> /* If flag is clear then we needn't do anything */
> if (!force && !(cur & I_VERSION_QUERIED))
> @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
>
> old = atomic64_cmpxchg(&inode->i_version, cur, new);
> - if (likely(old == cur))
> + if (likely(old == cur)) {
> + smp_mb__after_atomic();

I don't think you need this. Cmpxchg is guaranteed to be full memory
barrier - from Documentation/atomic_t.txt:
- RMW operations that have a return value are fully ordered;

> break;
> + }
> cur = old;
> }
> return true;

...

> @@ -248,7 +259,7 @@ inode_query_iversion(struct inode *inode)
> {
> u64 cur, old, new;
>
> - cur = atomic64_read(&inode->i_version);
> + cur = inode_peek_iversion_raw(inode);
> for (;;) {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED)

And here I'd expect smp_mb() after inode_peek_iversion_raw() (actually be
needed only if you are not going to do cmpxchg as that implies barrier as
well). "Safe" use of i_version would be:

Update:

modify inode
inode_maybe_inc_iversion(inode)

Read:

my_version = inode_query_iversion(inode)
get inode data

And you need to make sure 'get inode data' does not get speculatively
evaluated before you actually sample i_version so that you are guaranteed
that if data changes, you will observe larger i_version in the future.

Also please add a comment smp_mb() in inode_maybe_inc_iversion() like:

/* This barrier pairs with the barrier in inode_query_iversion() */

and a similar comment to inode_query_iversion(). Because memory barriers
make sense only in pairs (see SMP BARRIER PAIRING in
Documentation/memory-barriers.txt).

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR