Re: [RFC] IPMI state machine regression

From: Corey Minyard
Date: Thu Aug 23 2018 - 12:23:05 EST


On 08/22/2018 11:23 AM, Andrew Banman wrote:
On Wed, Aug 22, 2018 at 11:14:52AM -0500, Corey Minyard wrote:
On 08/21/2018 05:14 PM, Andrew Banman wrote:
Dear IPMI supporters,

We observe a window in IPMI BT's opportunistic get capabilities request,
wherein GET_DEVICE_GUID and GET_DEVICE_ID requests may start while the BT state
machine is in WR_CONSUME. Following this, the 0xD5 error code is forced in
bt_start_transaction, IPMI fails to initialize, and the interface is torn down.
There is no mechanism to retry bringing up the interface in open() /dev/ipmi.
This leaves IPMI hosed until you reload modules. Looks to happen after we call
schedule().
When was the latest kernel where this worked properly? Also, what hardware
is this?
This is UV4.

First known bad commit, but I am not sure if the timing issue predates
it:

commit aa9c9ab2443e3b9562c6c7cfc245a9e43b557d14
Author: Jeremy Kerr <jk@xxxxxxxxxx>
Date: Fri Aug 25 15:47:24 2017 +0800

ipmi: allow dynamic BMC version information

Hits less frequently with older kernels so I didn't see it until
recently when it became more frequent.

Ok, that's for the crash, which makes sense. But that's an easy problem to fix.
I would like a "Tested-by" on that, if you get to test it, though I was able to
simulate various failures there to test it out.

So reading between the lines ("more frequent") I'm guessing that this still
happened with older kernels, but is becoming annoying with newer kernels.
I would guess recent changes causes it to happen more often due to changes
in the way the upper layer interacts with the lower layers, you will have more
messages at startup, and the timing is somewhat different.

The BT code itself hasn't changed much in over 10 years. Nothing that
looks like it would cause an issue like this. So I would guess this is an
issue that has been around for a while.

I don't have any real hardware with a BT interface, just the one in qemu,
but I've never seen it there.

It actually looks like the state machine is working ok. But the BMC is
responding to a "Get Device ID" command with:

Recv:: 1c 08 d5


That's an error response with D5, which is "Cannot execute command.
Command, or request parameter(s), not supported in present state."
That's an error response from your BMC. That particular command
shouldn't ever respond with that error, so I think the bug here is
with your BMC.

-corey