Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.

From: Havard Skinnemoen
Date: Fri Jul 11 2014 - 20:10:18 EST


On Fri, Jul 11, 2014 at 1:22 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
>> > * max number of CMCIs per second a system can sustain fine, i.e. the 100
>> > above
>>
>> What's the definition of "fine"? 1% performance hit? 10%? How can we
>> make that decision without knowing how hard the users are pushing
>> their systems?
>
> Those are definitely unchartered territories we're moving into so yes,
> answering that won't be easy.
>
> Let's try it: if the anount of time we spend per second in the CMCI
> handler becomes higher than the amount of time we spend polling for that
> same second, then we might just as well poll and save us the interrupt
> load.
>
> So, for example, say for 10ms poll rate and single poll duration of
> 2ms, if time spent in CMCI exceeds 200ms for that second, we switch to
> polling. Hypothetical numbers, of course.

200ms per second means we're using 20% of that CPU. I'd say that's
definitely too much. But I like the general approach.

> Can we measure it on every system? Probably not. So we need to
> approximate it somehow.
>
>>
>> > * total polling duration during storm, i.e. the 1 second above
>> >
>> > and if those are chosen generously for every system out there, then we
>> > don't need to dynamically adjust the polling interval.
>>
>> I'm not sure how we can be generous when there's a tradeoff involved.
>> If we make the interval "generously low", we end up hurting
>> performance. if we make it "generously high", we'll lose information.
>
> Yeah, by "generous" I meant, choose values which fit all. But I realize
> now that this is a dumb idea. Maybe we could measure it on each system,
> read the TSC on CMCI entry and exit and thus get an average CMCI
> duration...

Sounds interesting. Some things that may need some more thought:

1. What percentage of CPU is OK to use before we consider it a storm?

2. How do we map that number to polling mode, where we may not see all
the errors? If we get it wrong, we may end up bouncing at a very high
rate.

3. If we go for a fixed polling rate, how do we make sure it doesn't
require more CPU than what we determined in (1)?

Havard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/