Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

From: Borislav Petkov
Date: Thu Mar 23 2017 - 11:22:57 EST


On Wed, Mar 22, 2017 at 07:03:39PM +0100, Borislav Petkov wrote:
> Lemme try to write a small script exercising exactly that scenario to
> see whether I'm actually not talking crap here :-)

Ok, here's a snapshot from the CEC after letting it run for a couple of
hours in a guest with a script running twice in parallel and injecting
random PFNs. We have 0 offlined pages because a PFN number doesn't
repeat frequently enough to cause an overflow.

When I force the occurrence of a single PFN for 1023 and more times and
do that more than once, this happens:

[ 6629.091239] RAS: Soft-offlining pfn: 0x7fff
[ 6629.093036] __get_any_page: 0x7fff free buddy page
[ 6653.259476] RAS: Soft-offlining pfn: 0x7fff
[ 6653.260100] soft offline: 0x7fff page already poisoned

...

Stats:
CEs: 32614
offlined pages: 2
^^^^^^^^^^^^^^^^^

Flags: 0x0
Timer interval: 86400 seconds
Decays: 254
Action threshold: 1023

The "already poisoned" thing shouldn't happen in real life because once
the page frame is poisoned, it shouldn't generate MCEs.




Every 2.0s: head -n 40 array; tail -n 40 array Thu Mar 23 17:15:15 2017

{ n: 512
000: [0000000000000056|c01]
001: [000000000000011f|801]
002: [0000000000000171|401]
003: [00000000000001ce|401]
004: [000000000000024a|401]
005: [000000000000026e|401]
006: [000000000000034d|c01]
007: [0000000000000395|c01]
008: [00000000000003b9|801]
009: [0000000000000458|003]
010: [000000000000045c|401]
011: [00000000000004f9|401]
012: [00000000000005d1|c01]
013: [0000000000000677|801]
014: [000000000000069d|401]
015: [00000000000006b3|401]
016: [00000000000006f5|c01]
017: [00000000000006fc|401]
018: [000000000000074d|401]
019: [0000000000000764|c01]
020: [00000000000008a8|801]
021: [0000000000000951|401]
022: [0000000000000994|401]
023: [0000000000000aa8|401]
024: [0000000000000ac7|801]
025: [0000000000000af2|801]
026: [0000000000000bb5|801]
027: [0000000000000bd5|401]
028: [0000000000000be0|c01]
029: [0000000000000c30|c01]
030: [0000000000000c61|801]
031: [0000000000000c8a|401]
032: [0000000000000d0d|801]
033: [0000000000000d2a|003]
034: [0000000000000d4d|401]
035: [0000000000000d87|c01]
036: [0000000000000da4|c01]
037: [0000000000000e06|401]
038: [0000000000000e23|c01]

...

480: [0000000000007d22|005]
481: [0000000000007d5f|002]
482: [0000000000007d9f|004]
483: [0000000000007db1|c01]
484: [0000000000007dbf|002]
485: [0000000000007dcf|002]
486: [0000000000007dd8|401]
487: [0000000000007df0|001]
488: [0000000000007df4|002]
489: [0000000000007e1f|003]
490: [0000000000007e35|801]
491: [0000000000007e73|003]
492: [0000000000007e77|401]
493: [0000000000007e80|002]
494: [0000000000007e9c|002]
495: [0000000000007eac|002]
496: [0000000000007ecb|002]
497: [0000000000007ed8|801]
498: [0000000000007edc|003]
499: [0000000000007ee3|801]
500: [0000000000007f05|004]
501: [0000000000007f15|002]
502: [0000000000007f51|004]
503: [0000000000007f5e|003]
504: [0000000000007f80|801]
505: [0000000000007f92|003]
506: [0000000000007fb2|002]
507: [0000000000007fd9|002]
508: [0000000000007fdf|002]
509: [0000000000007fe5|004]
510: [0000000000007ff4|801]
511: [0000000000007ffa|001]
}
Stats:
CEs: 30074
offlined pages: 0
Flags: 0x0
Timer interval: 86400 seconds
Decays: 234
Action threshold: 1023

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.