RE: [PATCH] Time out for possible dead loops during queuedinvalidation wait

From: Yu, Fenghua
Date: Wed May 27 2009 - 20:47:55 EST


>> >> Which error code is better? Is EAGAIN ok?
>> >
>> >That depends on driver details - probably EIO would be suitable, dunno.
>> >
>> >But all the callers of qi_submit_sync() seem to just drop the error
>> >code on the floor:
>> >
>> > /* should never fail */
>> > qi_submit_sync(&desc, iommu);
>> >
>> >and may well cause a kernel crash as a result.
>>
>> Should the code go to kernel panic after timeout in qi_submit_sync()
>loops? When timeout (10 seconds) in the loops, something in hardware could
>be wrong.
>>
>
>The most important thing to do when an error is detected is to protect
>the user's data, perhaps by reliably halting the kernel.
>
>The second most important thing is to report what happened, so people
>can fix things.
>
>The third most important thing is to attempt to recover from the error
>so that the kernel continues to function. This third requirement often
>makes the second one more successful: a still-running kernel can report
>things which a crashed kernel cannot.
>
>In this particualr case I'd suggest that the driver be converted to
>correctly recognise errors, clean things up and propagate the errors
>back in an orderly fashion, as usual.
>
>
>otoh... How likely is it that this timeout actually occurs? If it's
>only conceivable that this can happen when the hardware is busted then
>I'd suggest that we not patch the kernel at all - the kernel really has
>little chance of surviving broken silicon so why bother adding a little
>bit of code here to handle a tiny subset of it?
>
>If the chip is indeed busted then afacit the kernel will get stuck in
>an infinite loop. That's OK, we can still diagnose those with NMI
>watchdogs, sysrq-p handlers, etc.

This timeout can occur either when the hardware is busted or when iommu kernel software or BIOS has a bug. The timeout detection with a suitable error code can help debug the hardware or software issues.

In some cases (e.g. interrupt migration in interrupt remapping), the timeout is not a blocking issue; it just impacts performance. If there is not the timeout detection, the cpu ends up in an infinite loop although system should be able to continue to run. But the timeout detection can dump warning info for user to debug and the system can continue to run.

Seems I will also need to change the upper level callers to correctly handle the error code (EIO) generated by qi_submit_sync().

Thanks.

-Fenghua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/