Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process

From: NeilBrown
Date: Sun Nov 05 2023 - 16:55:48 EST


On Sun, 05 Nov 2023, Donald Buczek wrote:
....
>
> for task in /proc/*/task/*; do
> echo "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
> cmd cat $task/stack
> done
>
> which can further be reduced to
>
> for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done
>
> This is absolutely reproducible. Above line unblocks the system reliably.
>
> Another remarkable thing: We've modified above code to do the
> processes slowly one by one and checking after each step if I/O
> resumed. And each time we've tested that, it was one of the 64 nfsd
> processes (but not the very first one tried). While the systems
> exports filesystems, we have absolutely no reason to assume, that any
> client actually tries to access this nfs server. Additionally, when
> the full script is run, the stack traces show all nfsd tasks in their
> normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).
>
> Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific
> assumed-to-be-idle nfsd thread could have such an "healing" effect?

/proc/PID/cmndline for an nfsd thread is empty. So it probably isn't
accessing 'cmdline' specifically that unblocks, but any (or almost any)
proc file for the process might help.

You say that *after* accessing cmdline, the "stack" file shows a normal
stack trace. It might be interesting to see if that same stack is
present *before* accessing cmdline. But my guess is that nfsd is mostly
a distraction.

It would help to see the fully "echo t > /proc/sysrq-trigger" list of all
process stacks. That should reveal where the blockage is.

NeilBrown


>
> I'm well aware, that, for example, a hardware problem might result in
> just anything and that the question might not be answerable at all.
> If so: please excuse the noise.
>
> Thanks
> Donald
> --
> Donald Buczek
> buczek@xxxxxxxxxxxxx
> Tel: +49 30 8413 1433
>
>