Re: uninterruptible sleep lockups

From: Bodo Eggert
Date: Wed Feb 23 2005 - 21:08:37 EST


On Wed, 23 Feb 2005, linux-os wrote:
> On Wed, 23 Feb 2005, Bodo Eggert wrote:
> > linux-os <linux-os@xxxxxxxxxxxx> wrote:

> >> You don't seem to understand. A process that's stuck in 'D' state
> >> shows a SEVERE error, usually with a hardware driver.
> >
> > Or a network filesystem mount to a no longer existing server or share.
>
> But that's a whole different problem. That's a systemic problem
> of "fail-over". Network file-systems really need to interface
> with an intermediate virtual device that can isolate failed
> systems and make them look "perfect" to individual machines.
>
> If you don't do this, then as soon as somebody trips over a
> wire, your database is trashed. I'm surprised that NFS, PCNFS,
> SMB, etc., actually work as well as everybody seems to
> think they do. Until the architectural problem is resolved,
> there are still going to be hung processes, trashed databases,
> etc.

You don't run databases over a network filesystem unless you're begging
for trouble. For the other common purposes you'll usurally get a more
stable behaviour, since the failure on the client won't prevent the server
from properly writing the metadata or flushing the cache.

> > How to clean up the stuck processes: (This requires a MMU)
> > Add an error path to each syscall (or create some generic error paths) and
> > keep the original stack frame. On errors, you can "longjump" (not exactly,
> > but similar) to the error path after copying the memory. The semaphore will
> > not be taken, and the code depending on the semaphore will not be executed.
> >
>
> Again, you are attacking the symptom. The problem could be resolved
> by using a local disk (or a disk file) for the immediate I/O and
> the I/O to the file-servers could occur whenever they are available.

a) There are systems without local storage.

b) It won't help while stat()ing a non-cached object.

c) This would involve race conditions for e.g. two disconnected nodes on
reconnect. AFAI can see, this race can be solved by:
c1) The final transaction must be delayed until it's ACKed or
NACKed. This may delay the D-State for some seconds, but not enough.
c2) The server will have to keep track of the clients and need to be told
when a user left for a trip to the south pole without unmounting.
Very undesirable.
c3) Ignoring. Very, very undesireable.
c4) Requiring explicit transaction handling by the applications.
Interesting, but not in the near future.

d) This won't allow synchronous updates without falling back to classic
handling.

e) The users will update some files, get a positive reply and shut down
their PCs before the changes can be commited to the server. If the
server will not come back or the client is not rebooted within
reasonable time, this will cause silent data loss.

f) This will require reliable identification of the network server.

g) I'm not only thinking of NFS/..., allthough I used it as _the_ example.
E.g. if you see your IDE drive failing, you'll want to declare it dead
instead of waiting $num_of_sectors times five minutes until the kernel
decides to give up.


I agree that most D-states are problems that need to be fixed instead of
being worked-around, but sometimes you can't fix the problem without
access to the crystal-ball-device. Therefore all devices that can block
will need a manual override (with different probability), and the
processes that were stuck will need a way to recover or be stuck forever.

Obvoiusly the system is healthy enough to do some important and
uninterruptible work after those errors occured, so having them stuck will
be OK for now. Instead, the next task might be freeing the file
descriptors preventing you from unmounting your removable media or network
share or allowing really-forced umount.

--
Top 100 things you don't want the sysadmin to say:
54. Uh huh......"nu -k $USER".. no problem....sure thing...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/