Re: [PATCH] Time out for possible dead loops during queuedinvalidation wait

From: Andrew Morton
Date: Wed May 27 2009 - 19:52:16 EST


On Wed, 27 May 2009 16:25:52 -0700
"Yu, Fenghua" <fenghua.yu@xxxxxxxxx> wrote:

> >> Which error code is better? Is EAGAIN ok?
> >
> >That depends on driver details - probably EIO would be suitable, dunno.
> >
> >But all the callers of qi_submit_sync() seem to just drop the error
> >code on the floor:
> >
> > /* should never fail */
> > qi_submit_sync(&desc, iommu);
> >
> >and may well cause a kernel crash as a result.
>
> Should the code go to kernel panic after timeout in qi_submit_sync() loops? When timeout (10 seconds) in the loops, something in hardware could be wrong.
>

The most important thing to do when an error is detected is to protect
the user's data, perhaps by reliably halting the kernel.

The second most important thing is to report what happened, so people
can fix things.

The third most important thing is to attempt to recover from the error
so that the kernel continues to function. This third requirement often
makes the second one more successful: a still-running kernel can report
things which a crashed kernel cannot.

In this particualr case I'd suggest that the driver be converted to
correctly recognise errors, clean things up and propagate the errors
back in an orderly fashion, as usual.


otoh... How likely is it that this timeout actually occurs? If it's
only conceivable that this can happen when the hardware is busted then
I'd suggest that we not patch the kernel at all - the kernel really has
little chance of surviving broken silicon so why bother adding a little
bit of code here to handle a tiny subset of it?

If the chip is indeed busted then afacit the kernel will get stuck in
an infinite loop. That's OK, we can still diagnose those with NMI
watchdogs, sysrq-p handlers, etc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/