Re: [PATCH 1/2 v6] x86/kexec_file: add e820 entry in case e820 type string matches to io resource name

From: lijiang
Date: Wed Nov 21 2018 - 05:54:27 EST


å 2018å11æ18æ 19:52, Borislav Petkov åé:
> On Fri, Nov 16, 2018 at 11:25:55AM +0800, lijiang wrote:
>> For the pci mmconfig issue, it should be good enough that the e820 reserved region
>> [mem 0x0000000078000000-0x000000008fffffff] is only passed to the second kernel, but
>> the pci mmconfig region is not the same in another machine.
>
> Yes. And now the question is, *which* reserved regions need to be mapped
> for the second kernel to function properly? How do we figure that out?
>
>> A simple case, hotplug a pci network card and use the ssh/nfs to dump the vmcore.
>> If the pci mmconfig region is not reserved in kdump kernel, the pci hotplug device
>> could not be recognized. So the pci network card won't work.
>
> Yes that's a good example; put *that* example in your commit message.
>
>> Here, there is an example about SME kdump. Maybe it can help to better understand.
>
> You keep pasting that and I've read it already. And you keep repeating
> that the reserved regions need to be mapped in the second kernel and I'm
> asking, how do we determine *which* regions should we pass to the second
> kernel?
>

For the pci mmconfig issue, it is clear according to Dave and Bjorn's comment. So i'd like
to explain two things:

1. why the SME kdump kernel does not work without the reserved ranges.

The first kernel can get e820 table's information from BIOS or bootloader, but for kdump
kernel, it can not use this method to get e820 table. Maybe you have known who should pass
the e820 ranges to the second kernel, it is just the kexec-tools.

Unfortunately, kernel does not pass the e820 reserved ranges to the kdump kernel, when use
the kexec_file_load syscall to load the kernel image and initramfs.

At the early boot stage, the caller will use the early_memremap() to remap the address space,
such as DMI, ACPI and Firmware, etc.

Here, i only use an example to illustrate the issue what the DMI encountered in kdump kernel.

At the early boot stage, the DMI will use the early_memremap() to remap the address ranges
[0xF0000-0xFFFFF], and then check for the SMBIOS entry point signature.

And when the caller checks the SMBIOS entry point signature, the caller will use the early_memremap()
to remap another reserved ranges [0x0x6286B000-0x6286EFFF] again, which is reported by firmware(or
SMBIOS).

Please refer to the code: drivers/firmware/dmi_scan.c

dmi_scan_machine()-> / dmi_early_remap()-> early_memremap()
\ dmi_smbios3_present()->dmi_walk_early()->dmi_early_remap()->early_memremap()

Obviously, the DMI remapped ranges [0xF0000-0xFFFFF] and [0x6286b000-0x6286efff] are not in the e820
table for the kdump kernel. Therefore, when SME is enabled, these regions will be consider to be
encrypted according to the following code: arch/x86/mm/ioremap.c. Actually, these regions are still
decrypted in kdump kernel. It has gone wrong.

Note:
When SME is enabled, if the memory regions are decrypted, but the caller marks the memory pages as
encrypted, the content of pages is unpredictable when read from memory.

early_memremap()->
early_memremap_pgprot_adjust()->
memremap_should_map_decrypted()->
e820__get_entry_type()

static bool memremap_should_map_decrypted(resource_size_t phys_addr,
unsigned long size)
{
int is_pmem;

......

/* Check if the address is outside kernel usable area */
switch (e820__get_entry_type(phys_addr, phys_addr + size - 1)) {
case E820_TYPE_RESERVED:
case E820_TYPE_ACPI:
case E820_TYPE_NVS:
case E820_TYPE_UNUSABLE:
/* For SEV, these areas are encrypted */
if (sev_active())
break;
/* Fallthrough */

case E820_TYPE_PRAM:
return true;
default:
break;
}

return false;
}

Similarly, for ACPI, firmware, etc. They will also encounter the same problems, eventually, which
causes the system fails to start. That is the reason why the SME kdump kernel does not work without
the reserved ranges.

2. why the all reserved ranges are passed to the second kernel(or which regions should we pass to
the second kernel?)

As previously mentioned, for the DMI, the reserved regions[0xF0000-0xFFFFF] and [0x0x6286B000-0x6286EFFF]
are accurately passed to the second kernel, the DMI can work well on my machine. But for the ACPI, firmware,
etc. How to deal with this?

I think that we have to find all places where to call the early_memremap(), and determine whether these
reserved regions need be passed to the second kernel, and then add the code for kdump kernel in these code
paths. It is really too expensive to do this.

Furthermore, i only did a same thing in kernel, just like the kexec-tools pass the all reserved ranges to the
second kernel.

If you understood this issue, you might ignore it, please.

Thanks.
Lianbo

> If we should pass *all* reserved regions, why?
>
> IOW, I'm looking for the *why* first.
>
> Thx.
>