Re: [PATCH RFC] x86: check for and defend against BIOS memory corruption

From: Jeremy Fitzhardinge
Date: Fri Aug 29 2008 - 03:21:39 EST


Ingo Molnar wrote:
> * RafaÅ MiÅecki <zajec5@xxxxxxxxx> wrote:
>
>
>> 2008/8/28 Jeremy Fitzhardinge <jeremy@xxxxxxxx>:
>>
>>> Some BIOSes have been observed to corrupt memory in the low 64k. This
>>> patch does two things:
>>> - Reserves all memory which does not have to be in that area, to
>>> prevent it from being used as general memory by the kernel. Things
>>> like the SMP trampoline are still in the memory, however.
>>> - Clears the reserved memory so we can observe changes to it.
>>> - Adds a function check_for_bios_corruption() which checks and reports on
>>> memory becoming unexpectedly non-zero. Currently it's called in the
>>> x86 fault handler, and the powermanagement debug output.
>>>
>>> RFC: What other places should we check for corruption in?
>>>
>>> [ Alan, RafaÅ: could you check you see:
>>> 1: corruption messages
>>> 2: no crashes
>>> Thanks -J
>>> ]
>>>
>> I was trying my best to crash system with this patch applied and failed :)
>>
>> Works great.
>>
>> Just wonder if I should expect any printk from
>> check_for_bios_corruption? I do not see any:
>>
>> zajec@sony:~> dmesg | grep -i corr
>> scanning 2 areas for BIOS corruption
>>
>
> that's _very_ weird.
>

No, it's expected. RafaÅ only got corruption when plugging his HDMI
cable, and I didn't put any corruption checks on that path (I'm not even
sure what kernel code would get executed in that case). Hugh's original
patch put a check in the hot path of the fault handler - and so it would
get called regularly - but I put it in the kernel-bug path, which is
fairly pointless given that we expect this patch to prevent the crashes.

It does, however, do the check in the pm state changes, so doing a
suspend should make it print some of the corruption it found. Alan's
case would be a better test for that though.

It does raise the question of where the good places to put the check
are. It shouldn't be too hot, given that it's scanning ~64k of memory,
but often enough to actually show something. I was thinking of putting
some calls in the acpi code itself, but got, erm, discouraged.

Maybe hooking into a sysrq key would be useful (sysrq-m?).

> maybe the BIOS expects _zeroes_ somewhere? Do you suddenly see crashes
> if you change this line in Jeremy's patch:
>
> + memset(__va(addr), 0, size);
>
> to something like:
>
> + memset(__va(addr), 0x55, size);
>
> If this does not tickle any messages either, then maybe the problem is
> in the identity of the entities we allocate in the first 64K. Is there a
> list of allocations that go there when Jeremy's patch is not applied?
>
> but ... i think with an earlier patch you saw corruption, right?
> Far-fetched idea: maybe it's some CPU erratum during suspend/resume that
> corrupts pagetables if the pagetables are allocated in the first 64K of
> RAM? In that case we should use a bootmem allocation for pagetables that
> give a minimum address of 64K.
>

RafaÅ's corruption was definitely non-zero. I think the corruption is
happening, but it's just not reported.

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/