Boot failure due to: x86/boot: Save fields explicitly, zero out everything else

From: John Hubbard
Date: Wed Aug 21 2019 - 14:20:37 EST


On 8/21/19 10:05 AM, Neil MacLeod wrote:
Hi John

The following bug might be of interest:

https://bugzilla.kernel.org/show_bug.cgi?id=204645

Let me know if I can be of any help.


Hi Neil,

First of all, I'm pasting in the bug information so that it's directly available
here:

===================================================================
Description: Neil MacLeod 2019-08-21 16:29:19 UTC
System: Intel i5 Skylake NUC (NUC6i5SYH)

This system boots fine from internal M2 (128GB) drive with 5.3-rc4.

With 5.3-rc5, it does not boot from M2 and is stuck on the Intel splash screen (no other text is displayed, no panic etc.). It will boot 5.3-rc5 from a USB flash memory stick (via the F10 boot menu), but not from the internal M2.

Bisecting between 5.3-rc4 and 5.3-rc5, the bad commit is:

neil@nm-linux:~/projects/pullrequest_repos/torvalds-linux$ git bisect bad
a90118c445cc7f07781de26a9684d4ec58bfcfd1 is the first bad commit
commit a90118c445cc7f07781de26a9684d4ec58bfcfd1
Author: John Hubbard <jhubbard@xxxxxxxxxx>
Date: Tue Jul 30 22:46:27 2019 -0700

x86/boot: Save fields explicitly, zero out everything else

Recent gcc compilers (gcc 9.1) generate warnings about an out of bounds
memset, if the memset goes accross several fields of a struct. This
generated a couple of warnings on x86_64 builds in sanitize_boot_params().

Fix this by explicitly saving the fields in struct boot_params
that are intended to be preserved, and zeroing all the rest.

[ tglx: Tagged for stable as it breaks the warning free build there as well ]

Suggested-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Suggested-by: H. Peter Anvin <hpa@xxxxxxxxx>
Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx>
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Link: https://lkml.kernel.org/r/20190731054627.5627-2-jhubbard@xxxxxxxxxx

:040000 040000 e0963edca990540dd759765a3d765af4698df892 d07e645eb3a500c31bd65526205e286ff6941187 M arch

Comment 1 Neil MacLeod 2019-08-21 16:31:35 UTC
The kernel is built with gcc-9.2.0.

Comment 2 Neil MacLeod 2019-08-21 16:55:48 UTC

5.3-rc5 with "x86/boot: Save fields explicitly, zero out everything else" reverted will build with gcc-9.2.0, and boot from M2.
===================================================================

I'm also CC'ing the lists, so they know that the patch has caused a regression, and
also out of hope that they can help me choose the shortest path forward to debugging
this. My first reaction is that the list of BOOT_PARAM_PRESERVE() fields is
flawed--by which I include cases of "the system improperly relied on a field that the spec said
should be zeroed". (After all, the boot_params->sentinel is set, which already means
the system is not really doing it right.)

So I'm going to cheat and ask right now if anyone notices either ommissions
or wrong entries here:

static void sanitize_boot_params(struct boot_params *boot_params)
{
...
const struct boot_params_to_save to_save[] = {
BOOT_PARAM_PRESERVE(screen_info),
BOOT_PARAM_PRESERVE(apm_bios_info),
BOOT_PARAM_PRESERVE(tboot_addr),
BOOT_PARAM_PRESERVE(ist_info),
BOOT_PARAM_PRESERVE(acpi_rsdp_addr),
BOOT_PARAM_PRESERVE(hd0_info),
BOOT_PARAM_PRESERVE(hd1_info),
BOOT_PARAM_PRESERVE(sys_desc_table),
BOOT_PARAM_PRESERVE(olpc_ofw_header),
BOOT_PARAM_PRESERVE(efi_info),
BOOT_PARAM_PRESERVE(alt_mem_k),
BOOT_PARAM_PRESERVE(scratch),
BOOT_PARAM_PRESERVE(e820_entries),
BOOT_PARAM_PRESERVE(eddbuf_entries),
BOOT_PARAM_PRESERVE(edd_mbr_sig_buf_entries),
BOOT_PARAM_PRESERVE(edd_mbr_sig_buffer),
BOOT_PARAM_PRESERVE(e820_table),
BOOT_PARAM_PRESERVE(eddbuf),
};

If not, then I think we need to bisect by...well, let's start with the
theory that we zeroed out too much, so we could start adding more fields
to preserve. If that doesn't find the problem, then it's more complicated,
and might be better to go the other direction: starting without my commit,
and zeroing out fields until we see the failure.

Are you able to test patches? It would take some time, since there are quite a
few fields. Alternately, if you provide some more system information details
(especially if we have any other notes about other failures and passes), then
I might be able to borrow a Skylake system and attempt a repro.


thanks,
--
John Hubbard
NVIDIA