Re: uninterruptible sleep lockups

From: Horst von Brand
Date: Tue Feb 22 2005 - 22:43:31 EST


Chris Friesen <cfriesen@xxxxxxxxxx> said:
> Horst von Brand wrote:
> > Anthony DiSante <theant@xxxxxxxxxxxxxxx> said:

> >>That's one of the things I asked a few messages ago. Some people on
> >>the list were saying that it'd be "really hard" and would "require a
> >>lot of bookkeeping" to "fix" permanently-D-stated processes... which is
> >>completely different than "impossible."

> > Most people here have little clue. It can't be done.

> I realize it would be extremely difficult if not impossible to do in the
> current linux architecture, but I find it hard to believe that it is
> technically impossible if one were allowed to design the system from
> scratch.

It is hard (if not impossible) to find out /what/ is broken (and how) and
fix it automatically. As you were told, D means the process is waiting for
some event. That event /might/ happen sometime (waiting for slow hardware)
or never (kernel programming error, hardware forgot the operation in
progress, ...). So you might fake it out by making believe the event did
happen. What if was just delayed, and /does/ then happen with nobody
waiting?

Any such is just papering over the problems, and is /massive/ complexity
for no real gain.

> Maybe I'm on crack, but would it not be technically possible to have all
> resource usage be tracked so that when a task tries to do something and
> hangs, eventually it gets cleaned up?

Sure. But there is /no way/ to know if the task will ever do something
(Turing's undecibility sees to that, even with perfect hardware), so the
only chance is to wait and see if the task releases it by itself. If you
just want to axe the task, you'd have to know beforehand what it will do
(and do it for the task on killing it). But the /task/ couldn't do it, what
guarantees the cleanup can?

> We already handle cleaning up stuff for userspace (memory, file
> descriptors, sockets, etc.).

On process end, i.e., when we know the stuff won't be used anymore. If the
program is stuck, kill it and go as before. If it doesn't go away cleanly,
something is /seriously/ wrong... and it is anybody's guess what.

> Why not enforce a design that says "all
> entities taking a lock must specify a maximum hold time".

It is hard enough to program without such restrictions. This would
incidentally also mean that the kernel has to be hard real time,
always. The usual PC hardware just isn't up to that, for starters.

And what would you do if you have nested locks, and the outer one times
out? Must kill the inner one beforehand... more complexity still.

> After that
> time expires, they are assumed to be hung, and all their resources
> (which were being tracked by some system) get cleaned up.

> It would probably be complicated, slow, and generally not worth the
> effort. But it seems at least technically possible.

If the system takes all extant resources for managing said resources, it is
somewhat pointless...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/