Re: [PATCH v2] x86/kexec: Exclude GART aperture from vmcore

From: Baoquan He
Date: Tue Jan 09 2018 - 01:19:45 EST


On 01/06/18 at 02:00am, Jiri Bohac wrote:
> Hi Baoquan,
>
> thank you very much for your review!
>
> On Wed, Dec 27, 2017 at 03:44:49PM +0800, Baoquan He wrote:
> > > > 1) If 'iommu=off' is specified in 1st kernel but not in kdump kernel, it
> > > > will ignore the ram we need dump.
> > >
> > > yes, instead of crashing the machine (because GART may be initialized in the
> > > 2nd kernel, overlapping the 1st kernel memory, which the 2nd kernel with its
> > > fake e820 map sees as unused).
> > >
> > > I'd say this is an improvement.
> >
> > I don't get what you said. If 'iommu=off' only specified in 1st kernel,
> > kdump kernel will think the memory which GART bar pointed as a hole.
> > This is incorrect. I don't see the improvement.
>
> Without the patch, the kernel will crash/hang the machine, as it
> will (correctly) try to dump the first kernel's memory that is
> now mapped by the GART.
>
> With the patch it will produce a dump with the data missing.
> But this is just a corner case, I don't see why someone would
> specify iommu=off in the first kernel and not the second.
>
> So instead of crashing and producing no dump at all, we get a
> dump with some data possibly missing.
>
> > I understand you could get a bug report from other people, and have to
> > fix it as an assignee. And this fix is located in aperture_64.c only,
> > I am fine it's done like this. Maybe you can try the way I suggested
> > that only removing the region from io resource, but not touching anything
> > else, if you have interest.
> >
> > So if have to, could you add some code comments around your fix to notice
> > people why these code are introduced? Commit log can help to understand
> > added code, while sometime file moving may make this checking very hard.
>
> Sure, the full patch with comments added is below. Do the
> comments make sense? Do you want me to re-post it as v3?:
>
> ---
> On machines where the GART aperture is mapped over physical RAM
> /proc/vmcore contains the remapped range and reading it may
> cause hangs or reboots.
>
> In the past, the GART region was added into the resource map, implemented by
> commit 56dd669a138c ("[PATCH] Insert GART region into resource map")
>
> However, inserting the iomem_resource from the early GART code caused
> resource conflicts with some AGP drivers (bko#72201), which got avoided by
> reverting the patch in commit 707d4eefbdb3 ("Revert [PATCH] Insert GART
> region into resource map"). This revert introduced the /proc/vmcore bug.
>
> The vmcore ELF header is either prepared by the kernel (when using the
> kexec_file_load syscall) or by the kexec userspace (when using the kexec_load
> syscall). Since we no longer have the GART iomem resource, the userspace
> kexec has no way of knowing which region to exclude from the ELF header.
>
> Changes from v1 of this patch:
> Instead of excluding the aperture from the ELF header, this patch
> makes /proc/vmcore return zeroes in the second kernel when attempting to
> read the aperture region. This is done by reusing the
> gart_oldmem_pfn_is_ram infrastructure originally intended to exclude XEN
> balooned memory. This works for both, the kexec_file_load and kexec_load
> syscalls.
>
> [Note that the GART region is the same in the first and second kernels:
> regardless whether the first kernel fixed up the northbridge/bios setting
> and mapped the aperture over physical memory, the second kernel finds the
> northbridge properly configured by the first kernel and the aperture
> never overlaps with e820 memory because the second kernel has a fake e820
> map created from the crashkernel memory regions. Thus, the second kernel
> keeps the aperture address/size as configured by the first kernel.]
>
> register_oldmem_pfn_is_ram can only register one callback and returns an error
> if the callback has been registered already. Since XEN used to be the only user
> of this function, it never checks the return value. Now that we have more than
> one user, I added a WARN_ON just in case agp, XEN, or any other future user of
> register_oldmem_pfn_is_ram were to step on each other's toes.
>
>
> Signed-off-by: Jiri Bohac <jbohac@xxxxxxx>
> Fixes: 707d4eefbdb3 ("Revert [PATCH] Insert GART region into resource map")

I am fine. There's no git branch for kdump since changes related
usually scatter in all components. For this one, you can post a new one
and send to Joerg since he maintains iommu code. And Andrew, he usually
helps to pick kdump kernel changes to his akpm tree.

Thanks
Baoquan

>
> diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
> index f5d92bc3b884..2c4d5ece7456 100644
> --- a/arch/x86/kernel/aperture_64.c
> +++ b/arch/x86/kernel/aperture_64.c
> @@ -30,6 +30,7 @@
> #include <asm/dma.h>
> #include <asm/amd_nb.h>
> #include <asm/x86_init.h>
> +#include <linux/crash_dump.h>
>
> /*
> * Using 512M as goal, in case kexec will load kernel_big
> @@ -56,6 +57,33 @@ int fallback_aper_force __initdata;
>
> int fix_aperture __initdata = 1;
>
> +#ifdef CONFIG_PROC_VMCORE
> +/*
> + * If the first kernel maps the aperture over e820 RAM, the kdump kernel will
> + * use the same range because it will remain configured in the northbridge.
> + * Trying to dump this area via /proc/vmcore may crash the machine, so exclude
> + * it from vmcore.
> + */
> +static unsigned long aperture_pfn_start, aperture_page_count;
> +
> +static int gart_oldmem_pfn_is_ram(unsigned long pfn)
> +{
> + return likely((pfn < aperture_pfn_start) ||
> + (pfn >= aperture_pfn_start + aperture_page_count));
> +}
> +
> +static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
> +{
> + aperture_pfn_start = aper_base >> PAGE_SHIFT;
> + aperture_page_count = (32 * 1024 * 1024) << aper_order >> PAGE_SHIFT;
> + WARN_ON(register_oldmem_pfn_is_ram(&gart_oldmem_pfn_is_ram));
> +}
> +#else
> +static void exclude_from_vmcore(u64 aper_base, u32 aper_order)
> +{
> +}
> +#endif
> +
> /* This code runs before the PCI subsystem is initialized, so just
> access the northbridge directly. */
>
> @@ -435,8 +463,16 @@ int __init gart_iommu_hole_init(void)
>
> out:
> if (!fix && !fallback_aper_force) {
> - if (last_aper_base)
> + if (last_aper_base) {
> + /*
> + * If this is the kdump kernel, the first kernel
> + * may have allocated the range over its e820 RAM
> + * and fixed up the northbridge
> + */
> + exclude_from_vmcore(last_aper_base, last_aper_order);
> +
> return 1;
> + }
> return 0;
> }
>
> @@ -473,6 +509,14 @@ int __init gart_iommu_hole_init(void)
> return 0;
> }
>
> + /*
> + * If this is the kdump kernel _and_ the first kernel did not
> + * configure the aperture in the northbridge, this range may
> + * overlap with the first kernel's memory. We can't access the
> + * range through vmcore even though it should be part of the dump.
> + */
> + exclude_from_vmcore(aper_alloc, aper_order);
> +
> /* Fix up the north bridges */
> for (i = 0; i < amd_nb_bus_dev_ranges[i].dev_limit; i++) {
> int bus, dev_base, dev_limit;
> diff --git a/arch/x86/xen/mmu_hvm.c b/arch/x86/xen/mmu_hvm.c
> index 2cfcfe4f6b2a..dd2ad82eee80 100644
> --- a/arch/x86/xen/mmu_hvm.c
> +++ b/arch/x86/xen/mmu_hvm.c
> @@ -75,6 +75,6 @@ void __init xen_hvm_init_mmu_ops(void)
> if (is_pagetable_dying_supported())
> pv_mmu_ops.exit_mmap = xen_hvm_exit_mmap;
> #ifdef CONFIG_PROC_VMCORE
> - register_oldmem_pfn_is_ram(&xen_oldmem_pfn_is_ram);
> + WARN_ON(register_oldmem_pfn_is_ram(&xen_oldmem_pfn_is_ram));
> #endif
> }
>
> --
> Jiri Bohac <jbohac@xxxxxxx>
> SUSE Labs, Prague, Czechia
>