Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.

From: Borislav Petkov
Date: Fri Jul 11 2014 - 16:36:23 EST


On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
> > Basically the scheme becomes the following:
> >
> > * We switch to polling if we detect a second CMCI under an interval X
> > * We poll Y times, each polling with a duration Z.
> > * If during those Y*Z msec of polling, we've encountered errors, we
> > enlarge the polling interval to additional Y*Z msec.
> >
> >
> > check_interval will be capped on the low end to something bigger than
> > the polling duration Y*Z and only the storm detection code will be
> > allowed to go to lower intervals and switch to polling.
> >
> > At least something like that. In general, I'd like to make it more
> > robust for every system without the need for user interaction, i.e.
> > adjusting check_interval and where it just works.
>
> But at the same time, this scheme introduces even more variables that
> need careful tuning, e.g. storm polling interval and storm duration,
> while not really doing anything to make check_interval superfluous. Do

Oh, we can't make check_interval superfluous - it is API to userspace
for a long time now.

> you really think we can tune these variables correctly for every
> system out there?

Right, I was trying to figure out a scheme first where polling intervals
and thresholds would actually make sense and not be arbitrary.

We probably won't be able to have the exact values for each system but a
smart approximation could do the job nicely enough.

> Or if we want to be generous: How about we just hardcode
> check_interval to 5 seconds. Would that be fine with everyone?

We could but again, it is an API to userspace exported through sysfs.

Besides, on a healthy system, you see errors so seldomly that 5sec is
pure waste of energy.

> > I don't know whether any of the above makes sense - I hope that the
> > gist of it at least shows what IO think we should be doing: instead
> > of letting users configure the check_interval and influence the CMCI
> > polling interval, we should rely purely on machine characteristics to
> > set minimum values under which we poll and above which, we do the normal
> > duration enlarging dance.
>
> I think the scheme may work, although I'm worried about the burstiness
> mentioned above.
>
> But I don't really buy that pulling a handful of numbers out of thin
> air and saying it should work for everyone is going to work.

No no, absolutely not. This is exactly what I think should be fixed as
the current numbers are likely pulled out of thin air. Simply because
figuring the optimal ones is a very hard task, as we come to realize.

> Either we need solid data to back up those numbers, or we need to make
> them configurable so people can experiment and find what works best
> for them.

..., or, we could measure them on each system and approximate them to
the ones close to optimal for that particular system, over the course of
its runtime.

Thanks for taking the time and humouring me with that crazy
brainstorming!

:-)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/