Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.

From: Havard Skinnemoen
Date: Wed Jul 09 2014 - 17:24:38 EST


On Wed, Jul 9, 2014 at 12:17 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> On Wed, Jul 09, 2014 at 10:09:21AM -0700, Havard Skinnemoen wrote:
> > From: Ewout van Bekkum <ewout@xxxxxxxxxx>
> >
> > The CMCI poll interval was updated to pick the minimum interval between
> > the original 30 seconds and the check_interval divided by 8 (minimum of
> > 3 polls).
>
> Why min 3 polls? How do you come up with exactly that frequency?

The idea is that if we make it equal to check_interval, it might
bounce back and forth a lot. So we need to divide by something, and 8
seems like a nice, safe value, and it seems to work well. We're not
opposed to considering other values, of course (e.g. 2 and 4 might
work well too, but with somewhat higher risk of ping-ponging).

> > This resolves a bug where the CMCI storm handler is unable to return to
> > interrupt mode from polling mode, if the check_interval shorter than the
> > CMCI poll interval. This problem is caused by the mce_timer_fn function
> > which only allows the poll interval to be incremented up to the
> > check_interval, while the mce_intel_adjust_timer function requires the
> > poll interval to be greater than the CMCI poll interval before leaving
> > the CMCI_STORM_ACTIVE state.
>
> Interesting. So it seems you guys want to set the check_interval to
> something < 30 secs.
>
> Out of curiosity, what is your use case which requires such small
> check_interval setting?

I'm not entirely sure. At some point, it ended up that way, and it
broke in non-obvious ways, so we wanted to fix it.

> Maybe we need to redesign and simplify this intervals thing to make it
> more user-friendly...
>
> Btw, on a related note, we're working on a small mechanism which
> collects correctable errors in the kernel and when a certain count for a
> physical error address has been reached, we soft-offline that page. We'd
> appreciate it if you guys took a look and told us whether it makes sense
> to you:
>
> http://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp@xxxxxxxxx

We will definitely take a look, thanks. Looks interesting, though it's
not always obvious what works for us until we actually go and try it.

Havard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/