Re: [PATCH v4 19/19] fs: handle inode->i_version more efficiently

From: Jeff Layton
Date: Mon Jan 08 2018 - 08:29:42 EST


On Mon, 2018-01-08 at 14:21 +0100, Krzysztof Kozlowski wrote:
> On Mon, Jan 8, 2018 at 1:56 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Sun, 2018-01-07 at 13:44 +0100, Krzysztof Kozlowski wrote:
> > > On Fri, Dec 22, 2017 at 1:05 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > From: Jeff Layton <jlayton@xxxxxxxxxx>
> > > >
> > > > Since i_version is mostly treated as an opaque value, we can exploit that
> > > > fact to avoid incrementing it when no one is watching. With that change,
> > > > we can avoid incrementing the counter on writes, unless someone has
> > > > queried for it since it was last incremented. If the a/c/mtime don't
> > > > change, and the i_version hasn't changed, then there's no need to dirty
> > > > the inode metadata on a write.
> > > >
> > > > Convert the i_version counter to an atomic64_t, and use the lowest order
> > > > bit to hold a flag that will tell whether anyone has queried the value
> > > > since it was last incremented.
> > > >
> > > > When we go to maybe increment it, we fetch the value and check the flag
> > > > bit. If it's clear then we don't need to do anything if the update
> > > > isn't being forced.
> > > >
> > > > If we do need to update, then we increment the counter by 2, and clear
> > > > the flag bit, and then use a CAS op to swap it into place. If that
> > > > works, we return true. If it doesn't then do it again with the value
> > > > that we fetch from the CAS operation.
> > > >
> > > > On the query side, if the flag is already set, then we just shift the
> > > > value down by 1 bit and return it. Otherwise, we set the flag in our
> > > > on-stack value and again use cmpxchg to swap it into place if it hasn't
> > > > changed. If it has, then we use the value from the cmpxchg as the new
> > > > "old" value and try again.
> > > >
> > > > This method allows us to avoid incrementing the counter on writes (and
> > > > dirtying the metadata) under typical workloads. We only need to increment
> > > > if it has been queried since it was last changed.
> > > >
> > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > ---
> > > > include/linux/fs.h | 2 +-
> > > > include/linux/iversion.h | 208 ++++++++++++++++++++++++++++++++++-------------
> > > > 2 files changed, 154 insertions(+), 56 deletions(-)
> > > >
> > >
> > > Hi,
> > >
> > > On recent linux-next my ARM/Exynos boards fail to boot over nfsroot.
> > > This commit popped up through bisect (log at the end). Systemd
> > > timeouts on some device-specific services, including mounting ext4
> > > /home:
> > >
> > > [ *** ] (1 of 4) A start job is running forâress polling (1min 41s / no limit)
> > > [ TIME ] Timed out waiting for device dev-ttySAC2.device.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for Serial Getty on ttySAC2.
> > > Jan 07 13:29:38 [ TIME ] Timed out waiting for device
> > > dev-disk-by\x2dlabel-home.device.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for /home.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for Local File Systems.
> > > Jan 07 13:29:38 [DEPEND] Dependency failed for File System Check on
> > > /dev/disk/by-label/home.
> > > Jan 07 13:30:02 [ *** ] (1 of 2) A start job is running forâress
> > > polling (1min 53s / no limit)
> > >
> > > Kernel command line:
> > > console=tty1 console=ttySAC2,115200n8
> > > ip=192.168.1.11:192.168.1.10:192.168.1.1:255.255.255.0::eth0:none
> > > nfsrootdebug root=/dev/nfs
> > > nfsroot=192.168.1.10:/srv/nfs/odroidxu3,vers=4,nolock rootwait rw
> > > smsc95xx.macaddr=00:1e:06:61:7a:93 no_console_suspend
> > >
> > > /home is /dev/mmcblk1p2:
> > > kozik@odroidxu3:~$ tune2fs -l /dev/mmcblk1p2
> > > tune2fs 1.43.7 (16-Oct-2017)
> > > Filesystem volume name: home
> > > Last mounted on: /home
> > > Filesystem UUID: 3f9dbeba-2738-45d3-807e-c1b2e21128ed
> > > Filesystem magic number: 0xEF53
> > > Filesystem revision #: 1 (dynamic)
> > > Filesystem features: has_journal ext_attr resize_inode dir_index
> > > filetype needs_recovery extent flex_bg sparse_super large_file
> > > uninit_bg dir_nlink extra_isize
> > > Filesystem flags: signed_directory_hash
> > > Default mount options: user_xattr acl
> > > Filesystem state: clean
> > > Errors behavior: Continue
> > > Filesystem OS type: Linux
> > > Inode count: 1430800
> > > Block count: 5717760
> > > Reserved block count: 285888
> > > Free blocks: 5467576
> > > Free inodes: 1428301
> > > First block: 0
> > > Block size: 4096
> > > Fragment size: 4096
> > > Reserved GDT blocks: 1022
> > > Blocks per group: 32768
> > > Fragments per group: 32768
> > > Inodes per group: 8176
> > > Inode blocks per group: 511
> > > Flex block group size: 16
> > > Filesystem created: Thu May 21 12:17:05 2015
> > > Last mount time: Thu Dec 21 13:31:26 2017
> > > Last write time: Thu Dec 21 13:31:26 2017
> > > Mount count: 1
> > > Maximum mount count: -1
> > > Last checked: Thu Dec 21 13:31:25 2017
> > > Check interval: 0 (<none>)
> > > Lifetime writes: 126 GB
> > > Reserved blocks uid: 0 (user root)
> > > Reserved blocks gid: 0 (group root)
> > > First inode: 11
> > > Inode size: 256
> > > Required extra isize: 28
> > > Desired extra isize: 28
> > > Journal inode: 8
> > > Default directory hash: half_md4
> > > Directory Hash Seed: 42e17e23-86b2-4356-ad63-78aa51651d03
> > > Journal backup: inode blocks
> > >
> > >
> > > Full dmesg log:
> > > http://www.krzk.eu/#/builders/1/builds/1258/steps/10/logs/serial0
> > >
> > > The regular boot from rootfs on SD card also fails - but without any
> > > serial console logs (just "Starting kernel...") so the real cause is
> > > unknown.
> > >
> > > Any hints?
> > >
> > > Best regards,
> > > Krzysztof
> > >
> > >
> > > bisect log:
> > > # bad: [73005e1a35fd67c644b0645c9e4c1efabd0fe62c] Add linux-next
> > > specific files for 20180103
> > > # good: [30a7acd573899fd8b8ac39236eff6468b195ac7d] Linux 4.15-rc6
> > > git bisect start 'next/master' 'next/stable'
> > > # bad: [c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a] Merge
> > > remote-tracking branch 'crypto/master'
> > > git bisect bad c1d290f9ce8daa2b0a79d2fe48c1b7c3c5370f5a
> > > # bad: [55695d94f0915121d106cd2d1ab94983a32f3e9a] Merge
> > > remote-tracking branch 'hid/for-next'
> > > git bisect bad 55695d94f0915121d106cd2d1ab94983a32f3e9a
> > > # good: [cffae1eead0dd91be1a3069a8348127bb00158f3] Merge
> > > remote-tracking branch 'realtek/for-next'
> > > git bisect good cffae1eead0dd91be1a3069a8348127bb00158f3
> > > # good: [5f889f1176dc99636c6bf8af7c286decc888c007] Merge
> > > remote-tracking branch 'btrfs/next'
> > > git bisect good 5f889f1176dc99636c6bf8af7c286decc888c007
> > > # good: [984c35877f36bee305e43a1c58176169854d85cf] Merge
> > > remote-tracking branch 'xfs/for-next'
> > > git bisect good 984c35877f36bee305e43a1c58176169854d85cf
> > > # bad: [f9fec502daea2a869232b6dff33ba3de79dd0d61] Merge
> > > remote-tracking branch 'printk/for-next'
> > > git bisect bad f9fec502daea2a869232b6dff33ba3de79dd0d61
> > > # good: [c71d227fc4133f949dae620ed5e3a250b43f2415] make kernel-side
> > > POLL... arch-independent
> > > git bisect good c71d227fc4133f949dae620ed5e3a250b43f2415
> > > # good: [416d20e8c31107f5dfd45d1d80d1e6c8e4871180] Merge branches
> > > 'work.get_user_pages_fast', 'work.wmci', 'work.sock_recvmsg',
> > > 'misc.netdrv', 'misc.poll', 'work.mqueue', 'work.whack-a-mole' and
> > > 'work.misc' into for-next
> > > git bisect good 416d20e8c31107f5dfd45d1d80d1e6c8e4871180
> > > # good: [325a1de4a691512a48c1426b943a7b0b9f8d6744] xfs: convert to new
> > > i_version API
> > > git bisect good 325a1de4a691512a48c1426b943a7b0b9f8d6744
> > > # good: [a94fe10fb114c169e7ddaecd8251521886409121] checkpatch: add
> > > pF/pf deprecation warning
> > > git bisect good a94fe10fb114c169e7ddaecd8251521886409121
> > > # good: [6b3911dffd1184fdcd63299a5fee59ac000f2067] btrfs: only dirty
> > > the inode in btrfs_update_time if something was changed
> > > git bisect good 6b3911dffd1184fdcd63299a5fee59ac000f2067
> > > # bad: [448f8c749a7a0ae03505823910ec45a112678048] Merge
> > > remote-tracking branch 'iversion/iversion-next'
> > > git bisect bad 448f8c749a7a0ae03505823910ec45a112678048
> > > # bad: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs: handle
> > > inode->i_version more efficiently
> > > git bisect bad 8618bff776758ebff5b55211e7b5a60a0fc119a5
> > > # first bad commit: [8618bff776758ebff5b55211e7b5a60a0fc119a5] fs:
> > > handle inode->i_version more efficiently
> >
> > That's really strange. I'm afraid I have no idea what could be going on.
> >
> > With NFS, we really just treat i_version as an opaque value, so I'm not
> > sure how this patch in particular would affect anything there. We _do_
> > increment it if you have a write delegation in some cases, but not many
> > servers hand those out.
>
> About the NFS server, Arch Linux on Raspberry Pi (so 32-bit, ARMv6):
> Linux pi 4.9.70-1-ARCH #1 SMP Mon Dec 18 19:38:00 UTC 2017 armv6l GNU/Linux
>
> The /etc/nfs.conf is default except:
> [nfsd]
> vers2=n
> vers3=n
>
> The client logs for nfsroot mounts are:
> Jan 08 14:07:25 :: running hook [net_nfs4]
> Jan 08 14:07:25 IP-Config: eth0 hardware address ba:17:70:7e:87:d1 mtu 1500
> Jan 08 14:07:25 IP-Config: eth0 guessed broadcast address 192.168.1.255
> Jan 08 14:07:25 IP-Config: eth0 complete (from 192.168.1.10):
> Jan 08 14:07:25 address: 192.168.1.11 broadcast: 192.168.1.255
> netmask: 255.255.255.0
> Jan 08 14:07:25 gateway: 192.168.1.1 dns0 : 0.0.0.0 dns1 : 0.0.0.0
> Jan 08 14:07:25 rootserver: 192.168.1.10 rootpath:
> Jan 08 14:07:25 filename :
> Jan 08 14:07:25 NFS-Mount: 192.168.1.10:/srv/nfs/odroidxu3
> Jan 08 14:07:25 Waiting 10 seconds for device /dev/nfs ...
> Jan 08 14:07:36 ERROR: device '/dev/nfs' not found. Skipping fsck.
> Jan 08 14:07:36 Mount cmd:
> Jan 08 14:07:36 /opt/tools/buildbot/arch-arm-bin/mount.nfs4 -o
> vers=4,nolock 192.168.1.10:/srv/nfs/odroidxu3 /new_root
>
> Only root (/) is froom NFS. The /home comes from sdcard (/dev/mmcblk1).
>
> > ext4 will only touch the i_version field if you mount it with '-o
> > iversion'. Are you doing that here?
>
> The /home is mounted by systemd from /etc/fstab:
> LABEL=home /home ext4 defaults 0 2
>
> >
> > Have you run the bisect more than once? Is this maybe an intermittent
> > problem, and the bisect has landed on the wrong commit?
>
> I just run tests on commits around again (but not full bisect) on
> current next (next-20180108):
> fbf97ece47d66d22 - works
> 3da7bcdd695bae43 ("fs: handle inode->i_version more efficiently") -
> fails: http://www.krzk.eu/#/builders/1/builds/1267
>
>
> Best regards,
> Krzysztof

Ok, thanks. If you're seeing hangs then that might imply that we have
some sort of excessive looping going on in the cmpxchg loops.

Could you apply the patch below and let me know if it causes either of
the warnings to pop? That might at least point us in the right
direction:

--------------------8<----------------------

[PATCH] DEBUG: throw warning if we are looping excessively in new
iversion code

Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
---
include/linux/iversion.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index 36ca56005c36..fe92bcc29f4c 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -168,6 +168,7 @@ static inline bool
inode_maybe_inc_iversion(struct inode *inode, bool force)
{
u64 cur, old, new;
+ int i = 0;

/*
* The i_version field is not strictly ordered with any other inode
@@ -193,6 +194,8 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
if (likely(old == cur))
break;
cur = old;
+ if (++i > 1000)
+ WARN_ONCE(1, "Too much looping!");
}
return true;
}
@@ -258,6 +261,7 @@ static inline u64
inode_query_iversion(struct inode *inode)
{
u64 cur, old, new;
+ int i = 0;

cur = inode_peek_iversion_raw(inode);
for (;;) {
@@ -277,6 +281,8 @@ inode_query_iversion(struct inode *inode)
if (likely(old == cur))
break;
cur = old;
+ if (++i > 1000)
+ WARN_ONCE(1, "Too much looping!");
}
return cur >> I_VERSION_QUERIED_SHIFT;
}
--
2.14.3