Re: NFS Caching broken in 4.19.37

From: Anton Ivanov
Date: Fri Feb 26 2021 - 10:41:42 EST


On 26/02/2021 15:03, Timo Rothenpieler wrote:
I think I can reproduce this, or something that at least looks very similar to this, on 5.10. Namely on 5.10.17 (On both Client and Server).

I think this is a different issue - see below.


We are running slurm, and since a while now (coincides with updating from 5.4 to 5.10, but a whole bunch of other stuff was updated at the same time, so it took me a while to correlate this) the logs it writes have been truncated, but only while they're being observed on the client, using tail -f or something like that.

Looks like this then:

On Server:
store01 /srv/export/home/users/timo/TestRun # ls -l slurm-41101.out
-rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
store01 /srv/export/home/users/timo/TestRun # wc -l slurm-41101.out
61 slurm-41101.out

On Client:
timo@login01 ~/TestRun $ ls -l slurm-41101.out
-rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
timo@login01 ~/TestRun $ wc -l slurm-41101.out
24 slurm-41101.out

See https://gist.github.com/BtbN/b9eb4fc08ccc53bb20087bce0bf9f826 for the respective file-contents.

If I run the same test job, wait until its done, and then look at its slurm.out file, it matches between NFS Client and Server.
If I tail -f the slurm.out on an NFS client, the file stops getting updated on the client, but keeps getting more logs written to it on the NFS server.

The slurm.out file is being written to by another NFS client, which is running on one of the compute nodes of the system. It's being reads from a login node.

These are two different clients, then what you see is possible on NFS with client side caching. If you have multiple clients reading/writing to the same files you usually need to tune the caching options and/or use locking. I suspect that if you leave it for a while (until the cache expires) it will sort itself out.

In my test-case it is just one client, it missed a file deletion and nothing short of an unmount and remount fixes that. I have waited for 30 mins+. It does not seem to refresh or expire. I also see the opposite behavior - the bug shows up on 4.x up to at least 5.4. I do not see it on 5.10.

Brgds,






Timo


On 21.02.2021 16:53, Anton Ivanov wrote:
Client side. This seems to be an entirely client side issue.

A variety of kernels on the clients starting from 4.9 and up to 5.10 using 4.19 servers. I have observed it on a 4.9 client versus 4.9 server earlier.

4.9 fails, 4.19 fails, 5.2 fails, 5.4 fails, 5.10 works.

At present the server is at 4.19.67 in all tests.

Linux jain 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

I can set-up a couple of alternative servers during the week, but so far everything is pointing towards a client fs cache issue, not a server one.

Brgds,




--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/